Fraud detection in Credit Card Transactions¶
Introduction¶
In recent years, digital payment methods have become the backbone of the global economy, revolutionizing how individuals, businesses, and governments transfer value. Credit and debit cards now facilitate billions of transactions every single day, powering everything from online shopping and contactless payments at physical stores to subscription-based platforms like Netflix or Spotify.
Their integration into mobile wallets, wearable technology, and global e-commerce ecosystems has made them not just convenient - but essential for modern life, and as digital infrastructures expand, card-based payments continue to replace cash at an accelerating pace, becoming the default method of exchange for billions of people.
However, this rapid expansion does not come without risks. According to data from the Nilson Report (Issue 1276, 2023), global card fraud losses surpassed $34 billion in 2022, and are projected to exceed $49 billion by 2030. Other industry estimates suggest that a fraudulent transaction takes place approximately every 16 seconds somewhere around the world. Credit card fraud presents disturbing figures, which highlights the scale and persistence of the threat.
But how is it done?
Usually, cybercriminals and fraudsters are leveraging increasingly sophisticated methods, such as:
Card-not-present (CNP) fraud, which are common in online transactions
Phishing and social engineering to steal credentials
Malware and data breaches that expose millions of card numbers
Synthetic identity fraud, where fake identities are constructed to open credit accounts
These techniques have proven to be highly effective, and they allowed to bypass traditional security systems, exploit digital payment infrastructures and cause finanical harm on a massive scale. It is also important to mention that the attacks do not only result in direct monetary losses for consumers and finanical institutions, but they also primarily undermine the trust in digital commerece.
So, we know we are dealing with a genuinely concerning problem, but now we must ask the question - what ideas have we come up with to solve this issue?
To begin with, many banks and financial institutions have deployed transaction monitoring systems. They have also share their blacklists of compromised cards, fraud "hot lists", and data breach information with other banks and merchants. From a legal and regulatory standpoint, several government legislations have been enacted. Educational campaigns have also been introduced to encourage people not to share card details, be cautious of phishing emails, and regularly review their bank statements.
But as data scientists, our focus is on analyzing the technical capabilities of credit card fraud detection systems - particularly from a performance point of view.
For decades now, credit card fraud detection methodologies have relied heavily on rule-based systems - essentially hard-coded, static if-else logic defined by human experts. These systems followed the industry's best practices, typically flagging transactions that:
Exceeded a predefined monetary threshold
Originated from high-risk countries
Occurred at unusual times (e.g., late at night)
Were initiated from previously unseen devices
While these rules were simple to implement and easy to interpret, they consistently underperformed in complex, fast-evolving fraud environments.
How badly have they underperformed?
According to a 2024 study comparing rule-based and machine learning approaches to fraud detection (Sule et al., 2024), a survey of 150 financial institutions revealed that traditional rule-based systems catch only about 65-70% of fraudulent transactions. This demonstrates that such systems are limited to detecting known or obvious patterns and lack the flexibility to adapt to constantly evolving attack strategies.
These systems suffer from three main drawbacks:
Lack of adaptability - The rules are reactive, not predictive. By the time a new rule is created, fraudsters have often changed tactics. Moreover, as the number of transactions grows into the millions per hour, maintaining, tuning, and updating these rules becomes increasingly impractical
High false-positive rates - Many legitimate users are falsely flagged and blocked, leading to frustration and loss of revenue, and eventually, loss of reputation
Inability to detect complex patterns - Fraud patterns often involve subtle feature interactions that rules cannot capture
Recognizing these limitations, we naturally turn to a modern solution that has shown great promise - Machine Learning and Deep Learning
Machine learning, and in recent years, deep learning - offers a fundamentally different paradigm. Instead of relying on hardcoded rules, these models learn patterns from data - whether linear trends, non-linear interactions, or hidden correlations.
When applied to credit card fraud detection, this approach provides several significant advantages over traditional rule-based systems:
Pattern Recognition at scale:
Machine learning algorithms excel at identifying subtle and complex behaviors across vast datasets. This allows them to detect not only known fraud signals, but also previously unseen patterns that rule-based systems would completely miss
Real-time detection
With the help of fast, lightweight models and advanced streaming techniques, machine learning - based systems can flag suspicious activity in near real-time. This enables finanical institutions to intervene before a fraudulent transaction is completed, thereby minimizing losses
Adaptive Learning
Unlike static rule engines, machine learning models can be continuously retrained on fresh data, allowing them to evolve alongside fraudster tactics. This makes them far more resilient against emerging threats such as synthetic identity fraud or bot-driven attacks
Reduced False Positives
One of the major limitations of rule-based systems is their high false-positive rate. By leveraging features like transaction time, location, amount, merchant category, device fingerprint and user behavior history, machine learning models can better distinguish between normal and suspicious activity - resulting in fewer legitimate transactions being blocked
In essence, machine learning and deep learning give fraud detection systems the ability to "think statistically", to learn from history, adapt to new trends, and respond to threats faster than humans can design new rules.
So... What's the catch?
While machine learning and deep learning offer powerful capabilities, applying them to credit card fraud detection is far from straightforward. These systems require large volumes of high-quality data, careful model design, and robust evaluation to be effective - and even then, several challenges persist. For instance:
Class Imbalance: in real-world scenarios, fradulent transactions are relatively rare compared to the total volume of transactions. This imbalance can cause models to favor the majority class (non-fraud) and overlook the minority class (fraud), leading to poor recall - that is, actual fraud cases being missed
Data privacy and availability: Accessing real-world transaction data is difficult due to strict confidentiality and compliance standards (e.g., GDPR, PCI DSS). This limits open research and model generalizability
False Positives vs. Customer Experience: A highly sensitive model may flag too many legitimate transactions as fraud, causing customer frustration, support overload and reputational damage. Striking the right balance between precision and recall is a constant challenge
To explore these challenges in a controlled environment, we turn to a publicly available dataset from Kaggle, which simulates real-world credit card transaction patterns. This dataset contains thousands of labeled transactions - some genuine, others fraudulent, and serves as the foundation for our fraud detection experiments.
Although it does not capture the full complexity of enterprise-scale financial systems, it enables us to simulate class imbalance, apply and compare various machine learning models, and evaluate the trade-offs between detection accuracy and false positive rates. By succeeding in this endeavor, we can gradually refine these models, making them increasingly robust and accurate over time.
Motivation¶
Credit card fraud is more than a technical anomaly - it's a global problem that directly affects the financial security and emotional well-being of millions of people. Behind every fraudulent transaction is a victim, someone whose trust was violated, whose savings may have been compromised, and who now faces a difficult and often bureaucratic process to reclaim what was lost.
The motivation for this project stems from our desire to contribute, even in a small way, to a safer digital ecosystem. As data science students and future data professionals, we believe we have a responsibility to harness machine learning not only for innovation, but also for protection. The growing accessibility of fraud tools on the dark web, the rise of AI-generated phishing attacks, and the sheer scale of financial losses each year all point to a troubling trend - one that requires urgent and ongoing attention.
By exploring how data-driven techniques can help detect and mitigate fraudulent behavior, we hope to highlight the positive and ethical role data science can play in safeguarding individuals and institutions alike.
Project Overview¶
This project aims to build an intelligent system capable of detecting fraudulent credit card transactions using machine learning techniques. We will approach this problem by leveraging the Credit Card Transactions dataset provided on Kaggle, which contains over 1.85 million anonymized records of credit card transactions. Each record includes a fraud label indicating whether the transaction is fradulent or not.
We will:
Analyze the distribution and characteristics of fraudulent vs. non-fraudulent transactions
Handle class imbalance, which is a core challenge in fraud detection
Train and compare multiple machine learning models, including neural networks for classification
Optimize the balance between recall (catching frauds) and precision (avoiding false alarms)
Evaluate models using real-world metrics such as recall, precision, F1-score, AUC-ROC
Later, we will summarize our findings, highlighting which models performed best under what conditions and discuss the trade-offs encountered during model tuning.
Information about the dataset:
You can access the dataset here:
👉 Kaggle - Credit Card Transactions
Each record in the dataset represents a single transaction and includes the following features:
trans_date_trans_time- Exact date and time of the transactioncc_num- Credit card number usedmerchant- Merchant or vendor where the purchase took placecategory- Transaction category (e.g., groceries, entertainment, etc.)amt- Monetary amount of the transaction (USD)first- Cardholder's first namelast- Cardholder's last namegender- Gender of the cardholderstreet- Street address of the cardholdercity- City of the cardholderstate- U.S. state (2-letter abbreviation)zip- ZIP code of the cardholder's addresslat- Latitude coordinate of the cardholder's locationlong- Longitude coordinate of the cardholder's locationcity_pop- Population of the cardholder's cityjob- Cardholder's occupationdob- Date of birth of the cardholdertrans_num- Unique transaction identifierunix_time- Transaction timestamp in UNIX formatmerch_lat- Latitude of the merchant's locationmerch_long- Longtitude of the merchant's locationis_fraud- Target label (0 = genuine, 1 = fraudulent)unnamed: 0- index column created during export (to be dropped)
Exploratory Data Analysis (EDA)¶
In this stage we will:
Analyze the dataset to understand its structure and feature distributions
Identify potential anomalies, outliers, or data quality issues
Use visualizations to uncover trends and relationships between features
Establish initial insights that will inform the preprocessing and modeling stages
Mount Google Drive¶
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Libraries¶
# Data Manipulation and Numerical Analysis
import pandas as pd
import numpy as np
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Scikit-learn
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_validate
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier,
HistGradientBoostingClassifier
)
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
average_precision_score
)
from sklearn.utils.class_weight import compute_class_weight
from sklearn.base import BaseEstimator, TransformerMixin
# Imbalanced Data Handling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from statsmodels.stats.proportion import proportions_ztest
from sklearn.decomposition import PCA
import re
# Geographical
import folium
from folium.plugins import MarkerCluster
# Clustering
from sklearn.manifold import TSNE
import time
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Temporal
import matplotlib.ticker as mtick
# Traning Models
import torch
import random
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.metrics import make_scorer
from IPython.display import clear_output
from xgboost import XGBClassifier
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.base import ClassifierMixin
from sklearn.model_selection import train_test_split
Importing the dataset¶
df_train = pd.read_csv('/content/drive/MyDrive/kaggle/סדנה במדעי הנתונים - עומר ומקס/fraudTrain.csv')
df_test = pd.read_csv('/content/drive/MyDrive/kaggle/סדנה במדעי הנתונים - עומר ומקס/fraudTest.csv')
Structure of the dataset¶
df_train.head()
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2019-01-01 00:00:18 | 2703186189652095 | fraud_Rippin, Kub and Mann | misc_net | 4.97 | Jennifer | Banks | F | 561 Perry Cove | ... | 36.0788 | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 0b242abb623afc578575680df30655b9 | 1325376018 | 36.011293 | -82.048315 | 0 |
| 1 | 1 | 2019-01-01 00:00:44 | 630423337322 | fraud_Heller, Gutmann and Zieme | grocery_pos | 107.23 | Stephanie | Gill | F | 43039 Riley Greens Suite 393 | ... | 48.8878 | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1f76529f8574734946361c461b024d99 | 1325376044 | 49.159047 | -118.186462 | 0 |
| 2 | 2 | 2019-01-01 00:00:51 | 38859492057661 | fraud_Lind-Buckridge | entertainment | 220.11 | Edward | Sanchez | M | 594 White Dale Suite 530 | ... | 42.1808 | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | a1a22d70485983eac12b5b88dad1cf95 | 1325376051 | 43.150704 | -112.154481 | 0 |
| 3 | 3 | 2019-01-01 00:01:16 | 3534093764340240 | fraud_Kutch, Hermiston and Farrell | gas_transport | 45.00 | Jeremy | White | M | 9443 Cynthia Court Apt. 038 | ... | 46.2306 | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 6b849c168bdad6f867558c3793159a81 | 1325376076 | 47.034331 | -112.561071 | 0 |
| 4 | 4 | 2019-01-01 00:03:06 | 375534208663984 | fraud_Keeling-Crist | misc_pos | 41.96 | Tyler | Garcia | M | 408 Bradley Rest | ... | 38.4207 | -79.4629 | 99 | Dance movement psychotherapist | 1986-03-28 | a41d7549acf90789359a9aa5346dcb46 | 1325376186 | 38.674999 | -78.632459 | 0 |
5 rows × 23 columns
df_test.head()
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2020-06-21 12:14:25 | 2291163933867244 | fraud_Kirlin and Sons | personal_care | 2.86 | Jeff | Elliott | M | 351 Darlene Green | ... | 33.9659 | -80.9355 | 333497 | Mechanical engineer | 1968-03-19 | 2da90c7d74bd46a0caf3777415b3ebd3 | 1371816865 | 33.986391 | -81.200714 | 0 |
| 1 | 1 | 2020-06-21 12:14:33 | 3573030041201292 | fraud_Sporer-Keebler | personal_care | 29.84 | Joanne | Williams | F | 3638 Marsh Union | ... | 40.3207 | -110.4360 | 302 | Sales professional, IT | 1990-01-17 | 324cc204407e99f51b0d6ca0055005e7 | 1371816873 | 39.450498 | -109.960431 | 0 |
| 2 | 2 | 2020-06-21 12:14:53 | 3598215285024754 | fraud_Swaniawski, Nitzsche and Welch | health_fitness | 41.28 | Ashley | Lopez | F | 9333 Valentine Point | ... | 40.6729 | -73.5365 | 34496 | Librarian, public | 1970-10-21 | c81755dbbbea9d5c77f094348a7579be | 1371816893 | 40.495810 | -74.196111 | 0 |
| 3 | 3 | 2020-06-21 12:15:15 | 3591919803438423 | fraud_Haley Group | misc_pos | 60.05 | Brian | Williams | M | 32941 Krystal Mill Apt. 552 | ... | 28.5697 | -80.8191 | 54767 | Set designer | 1987-07-25 | 2159175b9efe66dc301f149d3d5abf8c | 1371816915 | 28.812398 | -80.883061 | 0 |
| 4 | 4 | 2020-06-21 12:15:17 | 3526826139003047 | fraud_Johnston-Casper | travel | 3.19 | Nathan | Massey | M | 5783 Evan Roads Apt. 465 | ... | 44.2529 | -85.0170 | 1126 | Furniture designer | 1955-07-06 | 57ff021bd3f328f8738bb535c302a31b | 1371816917 | 44.959148 | -85.884734 | 0 |
5 rows × 23 columns
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1296675 entries, 0 to 1296674 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1296675 non-null int64 1 trans_date_trans_time 1296675 non-null object 2 cc_num 1296675 non-null int64 3 merchant 1296675 non-null object 4 category 1296675 non-null object 5 amt 1296675 non-null float64 6 first 1296675 non-null object 7 last 1296675 non-null object 8 gender 1296675 non-null object 9 street 1296675 non-null object 10 city 1296675 non-null object 11 state 1296675 non-null object 12 zip 1296675 non-null int64 13 lat 1296675 non-null float64 14 long 1296675 non-null float64 15 city_pop 1296675 non-null int64 16 job 1296675 non-null object 17 dob 1296675 non-null object 18 trans_num 1296675 non-null object 19 unix_time 1296675 non-null int64 20 merch_lat 1296675 non-null float64 21 merch_long 1296675 non-null float64 22 is_fraud 1296675 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 227.5+ MB
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 555719 entries, 0 to 555718 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 555719 non-null int64 1 trans_date_trans_time 555719 non-null object 2 cc_num 555719 non-null int64 3 merchant 555719 non-null object 4 category 555719 non-null object 5 amt 555719 non-null float64 6 first 555719 non-null object 7 last 555719 non-null object 8 gender 555719 non-null object 9 street 555719 non-null object 10 city 555719 non-null object 11 state 555719 non-null object 12 zip 555719 non-null int64 13 lat 555719 non-null float64 14 long 555719 non-null float64 15 city_pop 555719 non-null int64 16 job 555719 non-null object 17 dob 555719 non-null object 18 trans_num 555719 non-null object 19 unix_time 555719 non-null int64 20 merch_lat 555719 non-null float64 21 merch_long 555719 non-null float64 22 is_fraud 555719 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 97.5+ MB
First impression:
After inspecting the dataset, several initial insights can be drawn:
Both training and test sets share a consistent schema - 23 columns, identical names, and matching data types
There are no missing values across any columns, which simplifies preprocessing and ensures data integrity
The dataset combines spatial, temporal and demographic information, including customer and merchant coordinates, timestamps, job titles and transaction amounts - an excellent basis for behavioral and anomaly-based fraud detection
Several columns such as
merchant,city,job, andcategoryare categorical (objectdtype) and will require proper encoding to prevent overfitting and handle high cardinality.The
is_fraudcolumn is included in both datasets, providing clear binary labels for supervised learning.
Overall, the dataset is well-structured, comprehensive, and realistic, which provides a strong foundation for the modeling phase.
Dataset Integrity and Scale Verification
Before moving further into feature-level exploration, we will validate the internal scale and realism of the dataset to ensure consistency with the reported simulation design:
Full Dataset:
df_full = pd.concat([df_train, df_test], axis=0)
unique_counts = df_full.nunique()
print(unique_counts)
Unnamed: 0 1296675 trans_date_trans_time 1819551 cc_num 999 merchant 693 category 14 amt 60616 first 355 last 486 gender 2 street 999 city 906 state 51 zip 985 lat 983 long 983 city_pop 891 job 497 dob 984 trans_num 1852394 unix_time 1819583 merch_lat 1754157 merch_long 1809753 is_fraud 2 dtype: int64
Train set:
print(f"Total transations (train): {len(df_train):,}")
print(f"Unique cardholders (cc_num): {df_train['cc_num'].nunique():,}")
print(f"Unique merchants: {df_train['merchant'].nunique():,}")
print(f"Unique categories: {df_train['category'].nunique():,}")
print(f"Transaction date range: {df_train['trans_date_trans_time'].min()} -> {df_train['trans_date_trans_time'].max()}")
Total transations (train): 1,296,675 Unique cardholders (cc_num): 983 Unique merchants: 693 Unique categories: 14 Transaction date range: 2019-01-01 00:00:18 -> 2020-06-21 12:13:37
print(f"Merchants: {df_train['merchant'].head(5)}")
print(f"Categories: {df_train['category'].unique()}")
Merchants: 0 fraud_Rippin, Kub and Mann 1 fraud_Heller, Gutmann and Zieme 2 fraud_Lind-Buckridge 3 fraud_Kutch, Hermiston and Farrell 4 fraud_Keeling-Crist Name: merchant, dtype: object Categories: ['misc_net' 'grocery_pos' 'entertainment' 'gas_transport' 'misc_pos' 'grocery_net' 'shopping_net' 'shopping_pos' 'food_dining' 'personal_care' 'health_fitness' 'travel' 'kids_pets' 'home']
Test set:
print(f"Total transations (test): {len(df_test):,}")
print(f"Unique cardholders (cc_num): {df_test['cc_num'].nunique():,}")
print(f"Unique merchants: {df_test['merchant'].nunique():,}")
print(f"Unique categories: {df_test['category'].nunique():,}")
print(f"Transaction date range: {df_test['trans_date_trans_time'].min()} -> {df_test['trans_date_trans_time'].max()}")
Total transations (test): 555,719 Unique cardholders (cc_num): 924 Unique merchants: 693 Unique categories: 14 Transaction date range: 2020-06-21 12:14:25 -> 2020-12-31 23:59:34
print(f"Merchants: {df_test['merchant'].head(5)}")
print(f"Categories: {df_test['category'].unique()}")
Merchants: 0 fraud_Kirlin and Sons 1 fraud_Sporer-Keebler 2 fraud_Swaniawski, Nitzsche and Welch 3 fraud_Haley Group 4 fraud_Johnston-Casper Name: merchant, dtype: object Categories: ['personal_care' 'health_fitness' 'misc_pos' 'travel' 'kids_pets' 'shopping_pos' 'food_dining' 'home' 'entertainment' 'shopping_net' 'misc_net' 'grocery_pos' 'gas_transport' 'grocery_net']
Explanation:
As was mentioned earlier, the dataset contains over 1.85 million transactions, seperated into 2 sets - the training set, with approximately 1.3 million transactions (≈70% of the total data) and the test set, with 555k transactions (≈30% of the whole data).
Together, they represent activity generated across 999 unique credit cards and 693 merchants, aligning closely with the reported simulation parameters of roughly 1,000 customers and 800 merchants (mentioned in Kaggle). These transactions are distributed among 14 merchant categories, covering everyday spending areas such as grocery, fuel, dining, shopping and travel - providing a balanced and realistic view of consumer behavior.
The temporal coverage extends from January 2019 to December 2020, with the training data spanning January 2019 to June 2020, and the test data continuing from June 2020 to December 2020. This continuous timeline captures approximately 2 years of transactional activity, sufficient to observe seasonal effects, behavioral variations and potential drift over time.
These findings confirm that the dataset's internal structure and temporal design are coherent, consistent and credible, faithfully representing the intended simulation logic rather than arbitrary synthetic data. While its overall size is modest compared to real-world credit card systems (which handle thousands of transactions per second), it is behaviorally representative, making it highly suitable for developing and validating fraud-detection models focused on transaction-level behavioral patterns rather than large-scale throughput.
Feature Cleanup and Selection
Before diving into deeper exploration, we perform some initial feature cleanup to reduce redundancy and remove non-informative columns
Columns to drop
Unnamed:0- Index column generated during CSV export (not useful for modeling)firstandlast- Names are non-predictive and irrelevant to fraud behaviortrans_num- Unique transaction identifier; does not contribute to predictive patterns
drop = ['Unnamed: 0', 'first', 'last', 'trans_num']
for i in drop:
df_train.drop(columns=i, inplace=True)
df_test.drop(columns=i, inplace=True)
Handling Redundant Temporal Columns
Both trans_date_trans_time and unix_time encode the transaction timestamp. Since they represent the same information, we can safely drop one to avoid redundancy. We retain trans_date_trans_time because it offers direct interpretability and allows for the extraction of meaningful temporal features such as:
Hour of the day (to capture time-of-day spending patterns)
Day of the week (to identify weekday vs. weekend behaviors)
Month and year (for seasonal and long-term trend analysis)
In contrast, unix_time represents the same information as a continuous integer timestamp (seconds since the Unix epoch), which lacks immediate interpretability and cannot directly provide calendar-based insights without conversion.
df_train[['unix_time', 'trans_date_trans_time']].head(5)
| unix_time | trans_date_trans_time | |
|---|---|---|
| 0 | 1325376018 | 2019-01-01 00:00:18 |
| 1 | 1325376044 | 2019-01-01 00:00:44 |
| 2 | 1325376051 | 2019-01-01 00:00:51 |
| 3 | 1325376076 | 2019-01-01 00:01:16 |
| 4 | 1325376186 | 2019-01-01 00:03:06 |
df_train = df_train.drop(columns=['unix_time'], errors='ignore')
df_test = df_test.drop(columns=['unix_time'], errors='ignore')
Duplicate Check
print(f"Number of duplicates rows (training set): {df_train.duplicated().sum()}")
print(f"Number of duplicates rows (test set): {df_test.duplicated().sum()}")
Number of duplicates rows (training set): 0 Number of duplicates rows (test set): 0
Overview Summary
Schema Match: Train and test sets align perfectly
Labels: The
is_fraudlabel is binary and completeData Quality: No missing values; data types are consistent
Volume split: ~70% train / 30% test - a standard and reasonable proportion.
Now that the dataset's integrity and structure have been verified, we can confidently proceed to exploratory visualization and feature-level analysis, where each variable will be examined both numerically and visually to uncover patterns, outliers, and potential predictive signals indicative of fraudulent behavior
Features¶
🕘 trans_date_trans_time¶
Data Integrity¶
The trans_date_trans_time feature shows complete and consistent values with no missing or anomalous entries.
Duplicate timestamps are expected since multiple transactions can occur at the same moment, and all recorded timestamps fall within the valid range of January 2019 - December 2020.
To facilitate time-based analysis, we derived the following temporal features from trans_date_trans_time feature:
# Apply datetime conversion and feature extraction for both datasets
for df in [df_train, df_test]:
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['hour'] = df['trans_date_trans_time'].dt.hour
df['day_of_week'] = df['trans_date_trans_time'].dt.day_name()
df['month'] = df['trans_date_trans_time'].dt.month
df['year'] = df['trans_date_trans_time'].dt.year
df['date'] = df['trans_date_trans_time'].dt.date
for name, df in [('df_train', df_train), ('df_test', df_test)]:
print(f"\nNumber of unique values in temporal features of {name}:")
for col in ['hour', 'day_of_week', 'month', 'year']:
print(f"{col} unique values: {df[col].nunique()}")
Number of unique values in temporal features of df_train: hour unique values: 24 day_of_week unique values: 7 month unique values: 12 year unique values: 2 Number of unique values in temporal features of df_test: hour unique values: 24 day_of_week unique values: 7 month unique values: 7 year unique values: 1
These results confirm that all temporal components were correctly extracted. We also examined df_test, which is something we normally avoid to prevent data leakage. However, in this case, the inspection focused solely on basic structural integrity, not on any patterns or distributions related to the target variable (is_fraud).
This validation step was necessary, because we added engineered temporal features to df_test, ensuring they are properly structured for models evaluation, and remain consistent with the feature set used to train the models on df_train. These features will replace the direct reliance on the original trans_date_trans_time column.
df_train = df_train.drop(columns=['trans_date_trans_time'], errors='ignore')
df_test = df_test.drop(columns=['trans_date_trans_time'], errors='ignore')
Overall, the trans_date_trans_time feature and its derived temporal components demonstrate excellent data integrity. We are now ready to explore the derived features and analyze interesting patterns in the data:
Transaction Volume over Time¶
daily_txn = df_train.groupby('date').size()
plt.figure(figsize=(12,6))
daily_txn.plot(kind='line', lw=1.5)
plt.title("Daily Transaction Volume Over Time")
plt.xlabel("Date")
plt.ylabel("Number of Transactions")
plt.grid(True, alpha=0.3)
plt.show()
Graph 1 - Daily Transaction Volume Over Time
The figure above illustrates the daily transaction counts between January 2019 and June 2020.
A consistent weekly cyclic pattern is visble, reflecting regular consumer activity. Transaction volumes rise gradually through 2019, stabilizing around 3,500 - 4,000 transactions per day, before spiking sharply to ≈ 6,000 transactions/day in late 2019. At the start of 2020, the volume drops to around 2,500 - 3,000 transactions/day, where it remains steady throughout the following months.
These structural shifts could be due to changes in simulated data generation, seasonal shopping cycles, or economic variations represented in the synthetic dataset. The strong periodic peaks and troughs likely correspond to weekly purchasing rhythms, which will be explored further in the weekday and hourly analyses that follow
💡 Interpretation Note
Interestingly, the sharp decline in early 2020 coincides with the real-world emergence of COVID-19, which may have been implicitly reflected in the simulator's generation logic. Even if unintentional, the pattern aligns with actual global spending slowdowns, providing a plausible, interpretable shift within the dataset's temporal structure.
This correspondence may help explain the reason for the sharp decline in transaction volume observed at the beginning of 2020
Transactions per Hour¶
hourly_txn = (
df_train.groupby('hour')
.size()
.reset_index(name="count")
)
sns.barplot(data=hourly_txn, x='hour', y='count')
plt.title("Transactions per Hour of Day")
plt.xlabel("Hour of Day (0-23)")
plt.ylabel("Number of transactions")
plt.show()
Graph 2 - Transactions per Hour of Day
The distribution of transactions across hours of the day reveals two distinct activity zones:
Hours 0-11 (midnight to late morning): transaction volumes remain relatively stable at around ~42K transactions per hour
Hours 12-23 (afternoon to midnight): A pronounced surge occurs, with volumes increasing sharply to around ~65K transactions per hour
This pattern indicates that most transactional activity occurs in the second half of the day, after noon.
The rise after 12:00 corresponds to typical consumer behavior, with increased purchasing during lunch breaks, afternoon shopping, and evening leisure or online spending.
For fraud detection, the hour-of-day feature is likely highly informative. Transaction density is far from uniform throughout the day, meaning unusual timing (e.g., very late night activity) may signal suspicious behavior.
💡 Interpretation Note
Interestingly, the sharp transition around midday could also stem from batch-based data generation or transaction posting delays, which are common in financial systems that process transactions in grouped cycles.
In a real-world context, this midday spike could reflect the combined effect of time zone overlaps (e.g., East Coast and West Coast transaction synchronization) or increased digital activity as users engage more with online platforms after work hours.
These hourly variations may later help us design more advanced temporal features (like is_night or is_weekend) that encode typical behavioral rhythms into our model
Transactions per Weekday¶
# Weekday analysis
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_by_day = (
df_train.groupby('day_of_week')
.size()
.groupby(level=0)
.mean()
.reindex(order)
)
sns.barplot(
data=avg_by_day.reset_index(),
x='day_of_week',
y=0,
order=order
)
plt.title("Average Transactions per Weekday")
plt.ylabel("Average Daily Transactions")
plt.xlabel("")
plt.show()
Graph 3 - Average Transactions per Weekday
The bar chart reveals a clear weekly seasonality in transaction activity. This analysis is based on the average number of daily transactions, since weekdays occur unevenly throughout the dataset (some months contain more Mondays or Fridays than others). Transaction volumes peak on Mondays and Sundays, suggesting higher consumer spending at the start and end of each week
Mid-week days (Wednesday through Friday) show a noticeable decline in activity, while Saturday sits in an intermediate range - higher than most weekdays but below the two peak days.
This pattern reflects typical consumer behavior: spending often increases during weekends and early in the week when individuals complete online purchases or handle routine payments after the weekend.
Transactions per Month¶
monthly_txn = (
df_train.groupby('month')
.size()
.reset_index(name="count")
)
sns.barplot(data=monthly_txn, x='month', y='count')
plt.title("Transactions per Month")
plt.xlabel("Months (1-12)")
plt.ylabel("Number of transactions")
plt.show()
Graph 4 - Transactions per Month
The bar chart illustrates the distribution of transactions across the twelve months of the year. A clear seasonal trend is visible, showing fluctuations in consumer activity throughout the year.
Transaction volumes gradually increase from January, reaching their highest levels during April to June, with May standing out as the peak month of activity. Following this mid-year high, there is a noticeable decline from July to October, indicating a period of reduced consumer spending.
Toward the end of the year, transaction counts rise again in December, reflecting a holiday-related surge in purchases, a common seasonal pattern in financial transaction data.
This monthly trend suggests that the dataset captures realistic seasonal consumer behavior, where higher transaction volumes correspond to known spending periods, such as spring and end-of-year shopping cycles.
Fraudulent transactions¶
Let us now analyze the rate of fraudulent transactions, based on the months, days and hours:
fig, axes = plt.subplots(3, 1, figsize=(12, 18))
# --- Graph 1: Fraud count by month ---
fraud_by_month = df_train.groupby('month')['is_fraud'].sum()
axes[0].plot(fraud_by_month.index, fraud_by_month.values, marker='o')
axes[0].set_title("Fraud Count by Month")
axes[0].set_ylabel("Fraud Count")
axes[0].set_xlabel("")
# --- Graph 2: Fraud count by weekday ---
order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
fraud_by_day = df_train.groupby('day_of_week')['is_fraud'].sum().reindex(order)
sns.barplot(x=fraud_by_day.index, y=fraud_by_day.values, ax=axes[1], order=order)
axes[1].set_title("Fraud Count by Weekday")
axes[1].set_ylabel("Fraud Count")
axes[1].set_xlabel("")
# --- Graph 3: Fraud count by hour ---
fraud_by_hour = df_train.groupby('hour')['is_fraud'].sum()
sns.barplot(x=fraud_by_hour.index, y=fraud_by_hour.values, ax=axes[2])
axes[2].set_title("Fraud Count by Hour of Day")
axes[2].set_ylabel("Fraud Count")
axes[2].set_xlabel("Hour (0–23)")
plt.tight_layout()
plt.show()
Graph 5 - Fraudulent transactions by Month, Weekday and Hour
The temporal distribution of fraudulent transactions reveals several behavioral patterns that align closely with normal consumer activity - a likely attempt by fraudsters to blend in and avoid detection
Monthly: Fraud rates vary noticeably throughout the year. Activity peaks around March - May, then drops sharply through July-October, before slightly rising again in December. This coincides with the highest overall tranasction volumes in Graph 4, which suggests that fraudsters may deliberately exploit periods of intense consumer activity, when their actions can be more easily concealed among a large number of legitimate transactions.
Weekly: Fraud occurs across all days of the week but is most frequent on weekends (Saturday-Sunday) and Mondays. This mirrors the pattern of regular transaction activity, reinforcing the idea that attackers intentionally target high-traffic periods when monitoring may be less strict or slower to respond.
Hourly: Fraud is concentrated late at night, especially between 21:00 and 03:00, far outside normal consumer behavior peaks.
These patterns indicate that fraudsters strategically time their actions to mimic legitimate behavior - taking advantange of busy transaction periods and low-monitoring hours. Consequently, time-based features such as
hour,day_of_weekandmonthprovide strong predictive signals for fraud detection models, helping to distinguish genuine consumer activity from subtle fraudulent behavior.
Overall, the trans_date_trans_time derived useful temporal components that demonstrate excellent data integrity and reveal clear and realistic behavioral patterns in transaction activity. These patterns, both for legitimate and fraudulent transactions, confirm that time-based behavior plays a key role in distinguishing between normal and suspicious activity.
Having established clear temporal behavior patterns, we now proceed to analyze card-level activity (cc_num) to explore how transaction frequency and fraud concentration vary across cardholders
💳 cc_num¶
Data Integrity¶
In the previous sections, we confirmed that the training set contains no missing values, including the cc_num feature, and identified 983 unique cardholders - consistent with the dataset's design.
At this point, we verify whether there are any unrealistic duplicate transactions, where the same card number, timestamps, and transaction amount appear together - a potential indicator of synthetic duplication or data leakage
dup_card_rows = df_train.duplicated(subset=['cc_num', 'year', 'month', 'day_of_week', 'hour', 'amt']).sum()
print(f"Duplicate transactions (same card, time, and amount): {dup_card_rows}")
Duplicate transactions (same card, time, and amount): 93
We can see that there are 93 identified duplicate records. While it is not a large amount of records (≈0.005% of all transactions), it is worth checking whether these transactions are anomalious or standard, realistic transactions:
dup_cards = (
df_train[df_train.duplicated(subset=['cc_num', 'year', 'month', 'day_of_week', 'hour', 'amt'], keep=False)]
.groupby('cc_num')
.size()
.reset_index(name="duplicate_count")
.sort_values('duplicate_count', ascending=False)
)
dup_cards.head(10)
| cc_num | duplicate_count | |
|---|---|---|
| 1 | 571365235126 | 4 |
| 5 | 4464457352619 | 4 |
| 6 | 4585132874641 | 4 |
| 8 | 30270432095985 | 4 |
| 11 | 30561214688470 | 4 |
| 40 | 3531129874770000 | 4 |
| 79 | 4536996888716062123 | 4 |
| 82 | 4956828990005111019 | 4 |
| 45 | 3553629419254918 | 4 |
| 23 | 341546199006537 | 4 |
We observe that the top 10 credit cards each exhibit four repeated transaction patterns. What stands out is that this duplication is highly structured rather than random - every card repeats exactly four times, a level of uniformity that seems too precise to occur by chance. In a noisy or corrupted dataset, we would expect varying repetition counts across cards. Therefore, this pattern likely reflects an intentional design or simulation effect rather than accidental duplication.
One possible explanation could be recurring legitimate payments, where cardholders repeatedly pay the same bill or subscription under similar conditions. However, this seems improbable, because the repetition is too consistent and limited in scope. If these were genuine recurring payments, we would expect a wider distribution of repetition frequencies and more extensive recurrence over time.
Let us observe what is the fraud rate among these duplicates:
df_train[df_train['cc_num'].isin(dup_cards['cc_num'])]['is_fraud'].value_counts(normalize=True)
| proportion | |
|---|---|
| is_fraud | |
| 0 | 0.9962 |
| 1 | 0.0038 |
The fraud rate among the duplicated transactions is extremely low, indicating that these repetitions are almost entirely non-fraudulent. This suggests that the duplicate patterns are not the result of malicious activity, but rather a byproduct of the simulation process or recurring legitimate-like behavior within the synthetic data.
In other words, while the duplication pattern is unusually structured, it does not correspond to elevated fraud risk and therefore does not compromise data integrity. Instead, it provides a minor but realistic layer of transaction redundancy, consistent with real-world payment systems where repeated or batched transactions occasionally occur. Therefore, we won't drop or modify the duplicates, and leave them as they are
Next, it is critical to check whether there is train-test overlap, which can lead to data leakage.
If the same credit card appears in both the training and test sets, a model might memorize a card's historical behavior instead of learning generalizable fraud patterns
train_cards = set(df_train['cc_num'].unique())
test_cards = set(df_test['cc_num'].unique())
overlap = len(train_cards & test_cards)
print(f"Cards appearing in both train and test: {overlap} / {len(test_cards)} ({overlap/len(test_cards):.2%})")
Cards appearing in both train and test: 908 / 924 (98.27%)
We observe that 908 out of 924 cards (≈ 98.3%) appear in both the training and test datasets.
This indicates that the dataset was split by transaction rather than by cardholder - meaning the same credit card can appear in both sets
While this design is valid for transactional modeling, it introduces a potential information leakage risk. A model might memorize individual card behavior instead of learning general fraud patterns.
To mitigate this, during modeling we should consider:
Using grouped cross-validation by
cc_num, ensuring that all transactions of a given card remain in the same foldEvaluating the model both with and without card-level features to the generalization capability
💡 Important Remark:
Although the overlap raises leakage concerns, it also opens the door for meaningful feature engineering. In real-world banking systems, institutions track and share historical information about cardholders. Inspired by this, we could later engineer a feature such as history_of_fraud - indicating whether a card has previously been involved in fraudulent activity.
Such a feature would emulate real fraud prevention mechanisms, where past behavior informs current risk, allowing the model to better identify high-risk cards while maintaining realistic, and ethical modeling practices.
Transaction Frequency per Card¶
Although card numbers serve primarily as identifiers, their activity levels can reveal behavioral patterns.
Here, we examine how many transactions each card performs:
df_train['cc_num'].value_counts().describe()
| count | |
|---|---|
| count | 983.000000 |
| mean | 1319.099695 |
| std | 812.235900 |
| min | 7.000000 |
| 25% | 525.000000 |
| 50% | 1054.000000 |
| 75% | 2025.000000 |
| max | 3123.000000 |
The transaction distribution per card is highly uneven:
- Minimum: 7 transactions
- Maximum: 3,123 transactions
- Median: 1,054 transactions
- Mean: ~1319
- Standard deviation: 812
This indicates that a small subset of high-activity cards contributes disproportionately to the total transaction volume - a common phenomenon in real financial datasets where some customers transact more frequently (e.g., business accounts, recurring payments).
For modeling, this implies that cc_num may introduce bias or overfitting if the model memorizes specific card patterns rather than learning generalized fraud behaviors
Outlier Exploration¶
Since cc_num is categorical (an ID), it can't have numeric outliers - but its behavioral characteristics can.
We therefore define "outliers" as cards that display:
Unusually high or low transactions counts (activity outliers)
Atypical fraud ratios compared to the general population
Abnormal mean transaction amounts
We identify behavioral outliers using the IQR method applied to transaction counts per card:
txn_per_card = df_train.groupby('cc_num')['cc_num'].count().rename('txn_count')
Q1 = txn_per_card.quantile(0.25)
Q3 = txn_per_card.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_cards = txn_per_card[(txn_per_card < lower_bound) | (txn_per_card > upper_bound)]
print(f"Number of outlier cards: {len(outlier_cards)}")
Number of outlier cards: 0
plt.figure(figsize=(10,5))
sns.histplot(txn_per_card, bins=50, kde=True)
plt.axvline(upper_bound, color='red', linestyle='--', label='Upper Outlier Threshold')
plt.axvline(lower_bound, color='red', linestyle='--')
plt.title("Distribution of Transactions per Card - Identifying Outliers")
plt.xlabel("Number of Transactions per Card")
plt.ylabel("Count of Cards")
plt.legend()
plt.show()
Graph 6 - Distribution of Transactions per Card (Outlier Detection)
The histogram above displays the number of transactions made by each credit card (
cc_num). The red dashed lines represent the calculated lower and upper statistical thresholds (1.5 x IQR rule).The distribution follows a multi-modal and right-skewed pattern, showing distinct clusters of cards around specific transaction ranges (roughly 500, 1000, 1500, 2000 and 3000). This clustering is typical for simulated transactional datasets, where user behavior is generated across several predefined activity levels - for example, light, moderate, and heavy spenders.
Notably, no cards fall beyond the upper outlier threshold, meaning all cardholders exhibit realistic transaction volumes. Even the most active cards (≈3000 transactions) remain within expected behavioral limits.
The absence of statistical outliers confirms that the
cc_numfeature demonstrates consistent and credible transaction patterns, without evidence of synthetic bias or extreme anomalies.The next section further examines how fraudulent activity is distributed across these cards, helping us understand whether certain cardholders are more exposed than others
Fraud Distribution and Concentration¶
Let us assess whether fraudulent activity is concentrated within a few cards or spread broadly across all cardholders:
fraud_counts_per_card = df_train.groupby('cc_num')['is_fraud'].sum()
plt.figure(figsize=(10,5))
sns.histplot(fraud_counts_per_card, bins=50, log=True)
plt.title("Number of cards per count of fraudulent transactions")
plt.xlabel("Number of fraudulent transactions")
plt.ylabel("Number of cards")
plt.show()
Graph 7 - Number of Cards per count of fraudulent transactions
The histogram shows that most cards experience few or no fraudulent transactions, typically between 5-15 per card, while only a handful of cards show significantly higher fraud counts.
This pattern suggests that fraud is widespread but low-intensity, resembling random attack behavior rather than repeated targeting of specific accounts
Fraud Exposure Rate
Finally, we check how many cards experienced at least one fraudulent transaction
fraud_cards = df_train[df_train['is_fraud'] == 1]['cc_num'].nunique()
total_cards = df_train['cc_num'].nunique()
print(f"{fraud_cards}/{total_cards} cards (~{fraud_cards/total_cards:.2%}) had at least one fraud")
762/983 cards (~77.52%) had at least one fraud
💡 Interpretation:
Around 77.5% of all cards (762 out of 983) experienced at least one fraud event. This confirms that fraud is not limited to a small subset of users, but rather distributed across the dataset - consistent with the simulator's goal to represent broad, population-level fraud exposure.
As a result, transaction-level behavioral features (such as time, amount, and merchant category) will be more effective than card identifiers themselves for detecting fraud patterns
The cc_num feature is clean, internally consistent and behaviorally informative. It forms a solid basis for aggregation-based features (e.g., fraud rate per card, average transaction frequency), though care must be taken to prevent overfitting in models that might memorize specific card behaviors.
Having established that, we now move to the merchant feature, which represents the point of transaction.
🛒 merchant¶
Data Integrity¶
The merchant feature represents the vendor or business where each transaction occurred. Earlier training set checks confirmed that this feature has 693 unique values and no missing entries, ensuring completeness.
However, completeness alone is insufficient - we must verify that the data is authentic, semantically consistent, and free from artificial duplication or irregularities.
Upon closer inspection, every merchant name follows a structured and realistic pattern, such as:
df_train['merchant'].sample(10)
| merchant | |
|---|---|
| 469158 | fraud_Cummings LLC |
| 886353 | fraud_Bernhard Inc |
| 715794 | fraud_Murray-Smitham |
| 1045963 | fraud_Haag-Blanda |
| 1029663 | fraud_Gutmann, McLaughlin and Wiza |
| 571553 | fraud_Kling Inc |
| 581164 | fraud_Berge-Hills |
| 1035728 | fraud_Boyer PLC |
| 749136 | fraud_Sporer-Keebler |
| 530846 | fraud_Witting, Beer and Ernser |
All merchant names begin with the "fraud_" prefix, followed by a synthetic business name composed of one or two surnames and an optional corporate suffix (Inc, LLC, Ltd, PLC, Group, and Sons, etc.).
To confirm the origin of this structure, we referenced the dataset's official description on Kaggle, which explicitly explain the generation process:
"The simulator has certain pre-defined lists of merchants, customers, and transaction categories. Using the Python library 'faker', and the number of customers and merchants you specify, an intermediate list is created. Transactions are then simulated according to behavioral profiles (e.g., 'adult females 25-50 rural') with defined transaction frequencies and amount distributions."
-Kartik2112, Kaggle Dataset: Credit Card Fraud Detection (Sparkov Simulator)
This confirms that merchant names were generated using the faker library, ensuring structural realism while remaining fully synthetic. Each name behaves like a legitimate business identifier, even though it was programmatically generated.
A quick check confirms that 100% of merchant entries begin with the "fraud_" prefix:
prefix_check = df_train['merchant'].str.startswith('fraud_').mean()
print(f"Percentage of merchants starting with 'fraud_: {prefix_check:.2%}")
Percentage of merchants starting with 'fraud_: 100.00%
Thus, the merchant feature exhibits uniform structure, synthetic consistency, and no semantic leakage of fraud-related meaning from its textual content.
To verify that merchant names do not implicitly encode fraud related information, we examine whether naming patterns (e.g., suffixes) correlate with fraud likelihood
global_fraud_rate = df_train['is_fraud'].mean()
print(f"Global fraud rate: {global_fraud_rate:.2%}")
df_train['suffix'] = df_train['merchant'].str.extract(r'(LLC|Group|Inc|and Sons|Ltd|PLC)', expand=False)
fraud_by_suffix = df_train.groupby('suffix')['is_fraud'].mean().sort_values(ascending=False)
print(fraud_by_suffix)
Global fraud rate: 0.58% suffix Inc 0.007364 PLC 0.006649 and Sons 0.005264 Ltd 0.005236 LLC 0.005103 Group 0.004678 Name: is_fraud, dtype: float64
The global fraud rate in the dataset is 0.58%, and the fraud ratios across all major merchant suffixes remain tightly clustered around this baseline - from 0.46% to 0.73%.
This minimal deviation (≈ ± 0.0015 in absolute terms) indicates that these fluctuations are statistically negligible and fall well within the range of normal sampling variation.
Therefore, no suffix category exhibits a disproportionately high fraud rate, confirming that merchant names do not encode or correlate with fraudulent behavior.
Merchant Distributions and Outliers¶
Next, we examine merchant-level transaction and fraud distributions to identify potential outliers or irregular concentration of fraud:
txn_per_merchant = df_train['merchant'].value_counts()
fraud_per_merchant = df_train.groupby('merchant')['is_fraud'].sum().sort_values(ascending=False)
display(txn_per_merchant.head(10)) # Top 10 merchants by transaction volume
display(fraud_per_merchant.head(10)) # Top 10 merchants by total fraud
| count | |
|---|---|
| merchant | |
| fraud_Kilback LLC | 4403 |
| fraud_Cormier LLC | 3649 |
| fraud_Schumm PLC | 3634 |
| fraud_Kuhn LLC | 3510 |
| fraud_Boyer PLC | 3493 |
| fraud_Dickinson Ltd | 3434 |
| fraud_Cummerata-Jones | 2736 |
| fraud_Kutch LLC | 2734 |
| fraud_Olson, Becker and Koch | 2723 |
| fraud_Stroman, Hudson and Erdman | 2721 |
| is_fraud | |
|---|---|
| merchant | |
| fraud_Rau and Sons | 49 |
| fraud_Cormier LLC | 48 |
| fraud_Kozey-Boehm | 48 |
| fraud_Kilback LLC | 47 |
| fraud_Doyle Ltd | 47 |
| fraud_Vandervort-Funk | 47 |
| fraud_Kuhn LLC | 44 |
| fraud_Padberg-Welch | 44 |
| fraud_Terry-Huel | 43 |
| fraud_Jast Ltd | 42 |
The top merchants by volume process between 3,000 - 4,400 transaction each, consistent with high-traffic businesses. Similarly, the top merchants by fraud counts correspond to these same high-volume entities, indicating that fraud frequency scales with activity, not with merchant identity.
top_merchants = pd.DataFrame({
'Transactions': txn_per_merchant.head(10),
'Fraud_Counts': fraud_per_merchant.head(10)
}).fillna(0)
top_merchants.plot(kind='bar', figsize=(10,5))
plt.title('Top 10 Merchants by Transactions vs. Fraud Counts')
plt.ylabel('Count')
plt.xlabel('Merchant')
plt.xticks(rotation=45, ha='right')
plt.show()
Graph 8 - Top Merchants by Transactions vs. Fraud Counts
The chart illustrates total transaction volume (blue) and total fraud count (orange) for the 10 most active merchants.
While the most active merchants naturally exhibit more fraud events, the ratio of fraud-to-total transactions remains stable across all entities. This demonstrates that fraud is proportionally distributed across the network rather than concentrated in specific merchants.
To ensure statistical validity, we'll apply an IQR-based outlier check on merchant transaction volumes:
Q1 = txn_per_merchant.quantile(0.25)
Q3 = txn_per_merchant.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_merchants = txn_per_merchant[
(txn_per_merchant < lower_bound) | (txn_per_merchant > upper_bound)
]
print(f"Number of outlier merchants: {len(outlier_merchants)}")
Number of outlier merchants: 5
plt.figure(figsize=(10,5))
sns.histplot(txn_per_merchant, bins=50, kde=True)
plt.axvline(upper_bound, color='red', linestyle='--', label='Upper Outlier Threshold')
plt.axvline(lower_bound, color='red', linestyle='--')
plt.title("Distribution of Transactions per Merchant - Identifying Outliers")
plt.xlabel("Number of Transactions per Merchant")
plt.ylabel("Count of Merchants")
plt.legend()
plt.show()
Graph 9 - Distribution of Transactions per Merchant (Outlier Detection)
The histogram above illustrates the distribution of transaction volumes per merchant, with the red dashed lines marking the lower and upper thresholds based on the 1.5 x IQR rule
The distribution is right-skewed and multimodal, suggesting several distinct merchant activity tiers - likely representing different merchant types such as small, medium, and high-volume vendors.
The analysis identified 5 outlier merchants exceeding the upper bound of normal activity (~3,500 transactions). These merchants exhibit exceptionally high transaction volumes compared to the rest of the population.
However, upon cross-checking with their respective fraud rates, these outliers do not display abnormal or inflated fraud ratios. Their elevated transaction counts are therefore attributed to legitimate high-volume business behavior, not data corruption or synthetic bias.
This pattern mirrors realistic market dynamics, where a small number of large retailers process a disproportionately high share of transaction, a natural "power-law" effect observed in real-world financial ecosystems.
Overall, the merchant feature is clean, structurally valid, and behaviorally consistent. Its values are uniformly generated, semantically neutral, and show realistic diversity in transaction frequency. The absence of abnormal fraud concentrations or naming irregularities confirms that merchants behave as reliable categorical identifiers.
For modeling standpoint, this feature may be best leveraged through aggregated or statistical representation - such as per-merchant fraud rate, mean transaction amount, or temporal activity frequency - rather than as a raw categorical label. This approach is motivated by the fact that there are hundreds of distinct merchants, making one-hot encoding inefficient and prone to sparsity. More so, fraud signals appear to stem from behavioral dynamics (such as spending frequency or transaction timing) rather than merchant identity itself.
With this validation complete, we now move on to explore the next feature - category, which describes the type of product or service purchased and may reveal further behavioral distinctions between legitimate and fraudulent transactions.
📚 category¶
The category feature specifies the type of merchant or industry associated with each transaction - e.g. "gas_transport", "grocery_pos", "shopping_net", "home", etc. It represents where and how money is spent, which makes it a behavioral and risk sensitive dimension in fraud analysis.
Based on previous analysis, there are 14 unique categories, and no missing or invalid entries. This compact yet complete categorical structure suggests excellent data consistency and semantic validity - each value corresponds to a well-defined merhant type rather than arbitrary labels.
Category Distribution¶
To evaluate category balance and detect potential dominanace or underrepresentation, we review transaction counts per category:
txn_per_cat = df_train['category'].value_counts()
txn_per_cat
| count | |
|---|---|
| category | |
| gas_transport | 131659 |
| grocery_pos | 123638 |
| home | 123115 |
| shopping_pos | 116672 |
| kids_pets | 113035 |
| shopping_net | 97543 |
| entertainment | 94014 |
| food_dining | 91461 |
| personal_care | 90758 |
| health_fitness | 85879 |
| misc_pos | 79655 |
| misc_net | 63287 |
| grocery_net | 45452 |
| travel | 40507 |
plt.figure(figsize=(10,6))
sns.barplot(
y=df_train['category'].value_counts().index,
x=df_train['category'].value_counts().values,
palette="Blues_r"
)
plt.title("Distribution of Transactions by Category", fontsize=14)
plt.xlabel("Number of Transactions", fontsize=12)
plt.ylabel("Category", fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.4)
plt.tight_layout()
plt.show()
/tmp/ipython-input-4151199195.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
Graph 10 - Distribution of Transactions by Category
The chart above illustrates the relative transaction volume across all 14 merchant categories.
The dataset is dominated by
gas_transport,grocery_posandhometransactions - each exceeding 120k records. These represent high-frequency, everyday purchases, characteristic of regular consumer spendingMid-tier categories such as
shopping_pos,kids_pets, andshopping_netmaintain strong representation (≈90k-110k), reflecting diverse commercial activity across both physical and online channelsLower-volume segments, including
grocery_netandtravel, still contain tens of thousands of transactions, ensuring that no category suffers from data sparsityThis balanced distribution indicates that the
categoryfeature is well-structured and statistically robust, with each class large enough to support meaningful fraud-rate comparisons.
Fraud Rate Analysis by Category¶
Next, we examine the total number of fraudulent transactions and fraud ratios per category:
fraud_per_cat = df_train.groupby('category')['is_fraud'].sum().sort_values(ascending=False)
fraud_rate_per_cat = df_train.groupby('category')['is_fraud'].mean().sort_values(ascending=False)
fig, ax1 = plt.subplots(figsize=(10,6))
# Blue bars for fraud counts
sns.barplot(
x=fraud_per_cat.index,
y=fraud_per_cat.values,
color='steelblue',
ax=ax1
)
ax1.set_ylabel("Fraudulent Transactions (Count)", color="steelblue")
ax1.tick_params(axis='x', rotation=45)
# Red line for fraud rate
ax2 = ax1.twinx()
sns.lineplot(
x=fraud_rate_per_cat.index,
y=fraud_rate_per_cat.values,
color="red",
marker="o",
ax=ax2
)
ax2.set_ylabel("Fraud Rate", color="red")
plt.title("Fraud Distribution Across Merchant Categories", fontsize=14)
plt.tight_layout()
plt.show()
Graph 11 - Fraud Distribution Across Merchant Categories
The chart above compares the number of fraudulent transactions (blue bars) with fraud rate (red line) across all merchant categories.
The categories
grocery_pos,shopping_netandmisc_netdominate both in total fraud counts and relative fraud rates, marking them as the three most fraud-prone sectors in the dataset.Notably,
shopping_netandmisc_netrepresent online or card-not-present channels, which are inherently more vulnerable to fraudulent activity due to weaker identity verification mechanisms.The
grocery_poscategory - typically a physical point-of-sale (POS) channel, is showing similarly high fraud involvement, which suggests either card cloning or local misuse, both common in real world retail fraud.In contrast, categories such as
travelandhealth_fitnessexhibit both low fraud counts and very low fraud rates, implying that fraudsters rarely target these sectors within the simulated environmentOverall, this dual-axis analysis highlights that fraud activity is not randomly distributed but rather clustered within specific commercial domains, primarily online retail and everyday POS categories. This pattern closely mirrors real-world fraud dynamics, where high-frequency, low-verification environments tend to attract the most fraudulent behavior
From the analyses above, the category feature demonstrates excellent data integrity, no structural anomalies and strong behavioral signal value. The fraud distribution across categories is both statistically meaningful and domain consistent, showing that fraudulent activity clusters around specific merchant types rather than occuring uniformly.
In particular, online and high-frequency sectors (shopping_net, misc_net, grocery_pos) emerge as consistently higher-risk environments, reflecting real-world vulnerabilities in card-not-present and everyday retail transactions. This insight provides direct value for model development - categorical embeddings or one-hot representations of category can help the model learn contextual risk patterns, meaning, to recognize that a $500 purchase in shopping_net might carry greater fraud likelihood than the same amount spent in travel or health_fitness.
Therefore, category is not only a clean and reliable feature, but also an informative behavioral predictor, one that signals where fraudulent behaviors are most likely to occur and should thus be explicitly incorporated into the model's feature design
💵 amt¶
The amt feature represents the monetary value of each transaction. As one of the most behaviorally revealing and predictive features in fraud detection, transaction amount provides critical insight into risk magnitude and spending intent.
From a behavioral perspective:
Legitimate customers tend to operate within consistent spending ranges that reflect their income and lifestyle
Fraudsters, on the other hand, face an optimization tradeoff: maximize profit while minimizing detection risk. This often results in two distinct fraudulent behaviors:
High-value thefts, where large transactions are attempted for maximum gain
Micro-transactions ("testing" behavior), where small amounts are used to probe card validity before escalating the fraud
Therefore, we expect the upper end of the transaction spectrum to show elevated fraud risk due to high-value exploitation attempts, while low-value transactions become suspicious only when they occur repeatedly or in clusters - for example, when a single card executes multiple small payments within a short time window, to the same merchant. This distinction reflects real-world anti-fraud and anti-structuring practices used in financial systems worldwide.
Data Integrity¶
Before interpreting behavioral patterns, it's essential to confirm that the amt feature represents valid and realistic monetary values
We begin by examining descriptive statistics and checking for impossible or inconsistent cases:
pd.set_option('display.float_format', '{:,.2f}'.format)
df_train['amt'].describe()
| amt | |
|---|---|
| count | 1,296,675.00 |
| mean | 70.35 |
| std | 160.32 |
| min | 1.00 |
| 25% | 9.65 |
| 50% | 47.52 |
| 75% | 83.14 |
| max | 28,948.90 |
The training set contains 1.29 million transactions, all with positive monetary values, confirming that there are no invalid (negative or zero) entries.
The minimum amount of $1 and a maximum of about $28,949 fall within realistic bounds for everyday and high-value spending - no anomalies or simulation errors were detected
The mean of $70 and median of $47.5 indicate a right-skewed distribution, where most purchases are small or moderate, while a few high-value transactions stretch the upper tail.
The standard deviation of ≈$160 reinforces this skewness, showing wide variability consistent with real-world consumer spending behavior
These results confirm that amt is a clean, logically consistent and trustworthy feature. The values correspond well to genuine transaction magnitudes rather than simulation noise or data corruption.
Distribution of Fraudulent vs. Legitimate Amounts¶
plt.figure(figsize=(10,6))
sns.kdeplot(df_train[df_train['is_fraud'] == 0]['amt'], label='Legitimate', fill=True)
sns.kdeplot(df_train[df_train['is_fraud'] == 1]['amt'], label='Fraudulent', fill=True, color='red')
plt.xscale('log')
plt.title("Distribution of Transaction Amounts")
plt.xlabel("Transaction Amount ($, log scale)")
plt.ylabel("Density")
plt.legend()
plt.show()
Graph 12 - Distribution of Transactions Amounts
The KDE plot above compares the density of transaction amounts between legitimate and fraudulent cases on a logarithmic scale.
Legitimate transactions cluster heavily bellow ≈$200, with density peaking between $10-$100, consistent with everyday consumer spending
Fraudulent transactions show two prominent peaks around $300 - $1000, indicating a strong preference for mid to high value operations. Very few frauds occur at extremely low amounts, implying that large-value exploitation is the dominant strategy in this dataset.
Overall, this visualization confirms that frauds are not uniformly distributed across the monetary spectrum, they occur disproportionately at higher transaction values, which provides a strong predictive signal for machine learning models
Category-Amount interaction (Which sectors have expensive frauds?)¶
By combining amt and category together, we can detect where the largest fraudulent transactions occur:
fraud_amt_by_cat = (
df_train[df_train['is_fraud'] == 1]
.groupby('category')['amt']
.mean()
.sort_values(ascending=False)
)
plt.figure(figsize=(10,6))
fraud_amt_by_cat.plot(kind='bar', color='crimson')
plt.title("Average Fraudulent Transaction Amount by Category")
plt.ylabel("Average Amount ($)")
plt.xlabel("Category")
plt.xticks(rotation=45)
plt.show()
Graph 13 - Average Fraudulent Transaction Amount by Category
The chart displays the mean dollar amount of fraudulent transactions per merchant category.
Fraudulent purchases in
shopping_net,shopping_posandmisc_netaverage $800 - $1000, indicating that fraudsters target high-value retail sectors where goods can be easily monetized.Mid-range categories like
entertainmentandgrocery_posshow moderate fraudulent amounts (\250$ - $500), suggesting attempts to blend large purchases within normal consumer behaviorIn contrast, essential service categories (e.g.,
gas_transport,health_fitness,personal_care) exhibit low-value frauds, consistent with their lower resale potential.This pattern aligns with economic rationality in fraud behavior - targeting sectors with the highest financial gain and lowest detection barriers.
Temporal Analysis: Amount over Time¶
The goal here is to understand whether transaction amounts and particularly high-value fraudulent amounts, show temporal patterns.
By examining the evolution of transaction values over months, weekdays, and hours, we can determine when high-risk behaviors are most likely to occur
Let's start by visualizing the average daily transaction amount (both overall and for frauds only):
avg_amt_daily = df_train.groupby('date')['amt'].mean()
avg_amt_daily_fraud = df_train[df_train['is_fraud'] == 1].groupby('date')['amt'].mean()
# plot
plt.figure(figsize=(12,6))
plt.plot(avg_amt_daily, label="Average Amount (All)", color='steelblue', linewidth=1.3)
plt.plot(avg_amt_daily_fraud, label="Average Amount (Fraud)", color='crimson', linewidth=1.3)
plt.title("Average Transaction Amount Over Time")
plt.xlabel("Date")
plt.ylabel("Average Amount ($)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Graph 14 - Average Transaction Amount Over Time
The graph clearly distinguishes between legitimate and fraudulent spending behaviors
legitimate transactions show stable and consistent spending habits, averaging around $60 - $80 per transaction with minimal daily variation
Fraudulent transactions, however, are highly erratic - their average value oscillates dramatically between $200 and $1000, spiking at irregular intervals.
This pattern suggests that fraudulent activity occurs in intermittent high-value bursts, likely corresponding to coordinated attack periods or isolated high-gain attempts. The persistent vertical gap between the two lines further confirms that fraudulent transactions consistently involve much larger sums, even though they represent a minor fraction of total volume.
In essence, while normal spending is predictable and stable, fraud behavior is sporadic, opportunistic, and high-impact, which is a defining characteristic of real-world financial crime
The analysis of amt demonstrates that transaction value is both clean and behaviorally rich. It consistently differentiates legitimate and fraudulent patterns, with fraud showing sporadic, high-value bursts and a clear preference for certain high-gain sectors and off-hour timings. These findings confirm that amt is a core predictive driver in fraud detection, capturing both economic magnitude and behavioral intent.
Having established the financial characteristics of fraud, we now turn to demographic indicators - starting with the gender feature, to explore whether transaction behaviors and fraud likelihood vary across customer profiles
♀♂ gender¶
The gender feature introduces an interesting behavioral and ethical dimension. On the one hand, it could reveal differences in spending patterns, risk exposure, or fraud targeting strategies between males and females, potentially useful for model interpretability.
On the other hand, incorporating gender directly into predictive models raises ethical and fairness concerns: bias amplification could cause certain groups to be unfairly flagged as high-risk.
Thus, the goal here is exploratory understanding, not predictive discrimination:
Data Integrity¶
df_train['gender'].value_counts(dropna=False)
| count | |
|---|---|
| gender | |
| F | 709863 |
| M | 586812 |
The training set contains only two valid gender categories:
M(male) andF(female)There are no missing or invalid entries, conirming full data completeness
In addition, we can see that the training set is not perfectly gender-balanced. There is a noticeably larger number of female records compared to male ones.
This imbalance is important to acknowledge: when comparing raw fraud counts, one gender might appear to have more fraud cases simply because it has more total transactions. Therefore, we will normalize by population to compare fraud rates, not absolute counts.
Overall, the gender feature is structurally clean, consistent, and ready for analysis
Fraud Rate by Gender¶
fraud_by_gender = df_train.groupby('gender')['is_fraud'].mean() * 100
plt.figure(figsize=(6,4))
sns.barplot(x=fraud_by_gender.index, y=fraud_by_gender.values, palette='Reds')
plt.title("Fraud Rate by Gender")
plt.xlabel("Gender")
plt.ylabel("Fraud Rate")
plt.show()
/tmp/ipython-input-2461314860.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=fraud_by_gender.index, y=fraud_by_gender.values, palette='Reds')
Graph 15 - Fraud Rate by Gender
After normalization, we find that although there are more female records overall, the fraud rate is higher among males. This means that proportionally, men are involved in fraudulent activity more often per transaction than women in this training set.
This could reflect several factors, for example:
behavioral tendencies (e.g., higher-risk spending or greater exposure to certain categories)
demographic differences in transaction volume
or simply simulation parameters in the data generator
While the difference is statistically noticeable, it should be interpreted cautiously: correlation does not imply causation, and using gender directly as a predictive feature could bias the model unfairly.
To conclude, this analysis shows that gender correlates modestly with fraud occurrence: men exhibit a slightly higher fraud rate, even though women dominate transaction volume. However, given the potential for ethical bias, gender should be treated as an interpretable variable, not a decisive predictor.
Next, we turn to a more geographical perspective, analyzing the impact of location-related features on fraud probability
🌎 Geographical Impact on Fraud¶
The geographical dimension of financial behavior often plays a critical role in understanding and predicting fraudulent activity. Fraud is not only a matter of who commits it and how, but also where it occurs. Geographic features can reveal:
Hotspots of fraudulent activity
Behavioral irregularities, such as purchases far from the cardholder's home region
Socioeconomic influences, since wealthier or denser urban areas often exhibit distinct spending and fraud patterns
Transaction network dynamics, reflected in the physical or digital distance between customers and merchants
To investigate these spatial aspects, we will combine and analyze several location-related features in our dataset:
| Feature | Description |
|---|---|
city, state, zip , street |
Administrative and regional identifiers for cardholder location |
lat, long |
Geographic coordinates of the cardholder |
city_pop |
Estimated population of the cardholder’s city |
merch_lat, merch_long |
Geographic coordinates of the merchant involved in the transaction |
Together, these variables allow us to explore fraud behavior across multiple spatial layers - from broad national trends to fine-grained local patterns
Data Integrity and Validation of Geographic Features¶
Before analyzing spatial fraud behavior, we will ensure that the geographical data we are working with is accurate, consistent, and realistic. Fraud analysis depends heavily on location-based reasoning, if our coordinates or city-state information are unreliable, any derived patterns would lose meaning.
To address this, we begin by validating all geographic features present in our dataset:
| Feature | Description | Validation Objective |
|---|---|---|
city, state , street , zip |
Administrative identifiers | Verify that all listed locations correspond to real or valid U.S. places |
lat, long |
Cardholder’s coordinates | Confirm they fall within the valid U.S. latitude-longitude range |
merch_lat, merch_long |
Merchant’s coordinates | Verify geographic plausibility - merchants should also be within U.S. boundaries |
city_pop |
City population estimate | Check for realistic, non-negative population sizes |
To conduct this validation, we use the U.S. Cities Database publicly available on GitHub by Kelvins.
This comprehensive dataset includes over 19,000 verified U.S. cities and provides:
City and state names
Geographic coordinates (latitude and longitude)
By cross-referencing our dataset with this authoritative U.S. source, we can:
Verify that our cities and states exist and are properly matched
Confirm that our coordinates (
lat,long,merch_lat,merch_long) fall within realistic U.S. boundariesDetect any synthetic anomalies or out-of-range coordinates, which might indicate data generation artifacts
url = "https://raw.githubusercontent.com/kelvins/US-Cities-Database/main/csv/us_cities.csv"
us_dataset = pd.read_csv(url)
# Preview
print(us_dataset.head())
print(us_dataset.columns)
ID STATE_CODE STATE_NAME CITY COUNTY LATITUDE LONGITUDE
0 1 AK Alaska Adak Aleutians West 56.00 -161.21
1 2 AK Alaska Akiachak Bethel 60.89 -161.39
2 3 AK Alaska Akiak Bethel 60.89 -161.20
3 4 AK Alaska Akutan Aleutians East 54.14 -165.79
4 5 AK Alaska Alakanuk Kusilvak 62.75 -164.60
Index(['ID', 'STATE_CODE', 'STATE_NAME', 'CITY', 'COUNTY', 'LATITUDE',
'LONGITUDE'],
dtype='object')
# Count unique values in each geographic feature
geo_features = ['city', 'street', 'state', 'zip', 'lat', 'long', 'merch_lat', 'merch_long', 'city_pop']
unique_counts = {col: df_train[col].nunique() for col in geo_features}
print("Unique values in each geographic feature:\n")
for feature, count in unique_counts.items():
print(f"{feature:<12}: {count:,}")
Unique values in each geographic feature: city : 894 street : 983 state : 51 zip : 970 lat : 968 long : 969 merch_lat : 1,247,805 merch_long : 1,275,745 city_pop : 879
The table below summarizes the number of unique values across all geographic features, helping us to assess diversity, realism and internal consistency in the dataset:
| Feature | Unique Values | Interpretation |
|---|---|---|
city |
894 | Matches a plausible number of medium-to-large U.S. cities represented in the simulation. Indicates broad geographic diversity without redundancy. |
street |
983 | A realistic variety of simulated street names - confirming address-level granularity. This aligns with expectations for synthetic but human-like data generated through name-based simulation (e.g., “Maple St”, “Main Ave”). |
state |
51 | Perfectly consistent — includes all 50 U.S. states plus Washington D.C. |
zip |
970 | Reasonable for a dataset of this scale, ZIP codes are highly granular, and ~1,000 unique values suggest broad spatial coverage without redundancy. |
lat, long |
968 / 969 | Indicates that each cardholder or cardholder location corresponds to a specific coordinate pair - a near one-to-one relationship. The small difference (969 vs. 968) likely reflects rounding or minimal coordinate overlap. |
merch_lat, merch_long |
1,247,805 / 1,275,745 | Extremely high diversity, nearly matching the total number of transactions — implying that each merchant transaction has a unique coordinate pair. This is consistent with how synthetic merchant IDs were generated in the dataset. |
city_pop |
879 | Close to matching the number of cities (894), confirming internal consistency - most cities have distinct population values. Minor overlaps may occur for small towns with shared population estimates. |
So far, we can see that all of the geographic features demonstrate logical diversity and consistency, now let's see how clean their values are, by cross-referencing with the U.S. database:
# Normalize state codes
valid_states = set(us_dataset['STATE_CODE'].unique())
state_match_ratio = df_train['state'].isin(valid_states).mean()
print(f"{state_match_ratio:.2%} of state codes in the dataset match valid U.S. states")
# Show any invalid or unknown state codes
invalid_states = df_train.loc[~df_train['state'].isin(valid_states), 'state'].unique()
print("Invalid or unrecognized states:", invalid_states)
100.00% of state codes in the dataset match valid U.S. states Invalid or unrecognized states: []
As we can see, 100% of the entries in the state feature match the valid U.S. states, confirming consistent and clean data.
Next, we validate lat, long, merch_lat and merch_long. The continental U.S. lies approximately within the following ranges:
| Dimension | Minimum | Maximum |
|---|---|---|
| Latitude | 24.0° N | 49.0° N |
| Longitude | –125.0° W | –66.0° W |
We will use these boundaries to check whether all geographic coordinates fall within realistic U.S. limits:
# Define valid U.S. geographic boundaries
lat_min, lat_max = 24.0, 49.0
lon_min, lon_max = -125.0, -66.0
# Check cardholder coordinates
invalid_lat = df_train[~df_train['lat'].between(lat_min, lat_max)]
invalid_lon = df_train[~df_train['long'].between(lon_min, lon_max)]
# Check merchant coordinates
invalid_merch_lat = df_train[~df_train['merch_lat'].between(lat_min, lat_max)]
invalid_merch_lon = df_train[~df_train['merch_long'].between(lon_min, lon_max)]
# Results
print(f"Invalid cardholder latitudes: {len(invalid_lat)}")
print(f"Invalid cardholder longitudes: {len(invalid_lon)}")
print(f"Invalid merchant latitudes: {len(invalid_merch_lat)}")
print(f"Invalid merchant longitudes: {len(invalid_merch_lon)}")
Invalid cardholder latitudes: 4679 Invalid cardholder longitudes: 4679 Invalid merchant latitudes: 11062 Invalid merchant longitudes: 5227
The coordinate validation shows that while the vast majority of both cardholder and merchant locations fall within valid U.S. boundaries, a small minority of points (≈ 0.4% for cardholders and ≈ 0.8% for merchants) lies slightly outside the continental latitude - longitude range. This deviation is expected and acceptable, as it likely represents U.S. territories (such as Hawaii, Alaska, or Puerto Rico) or minor coordinate noise introduced during data simulation. Therefore, these outliers do not compromise the overall geographic integrity of the dataset, the coordinate features remain realistic, coherent and suitable for analysis
Let's analyze the city feature:
# Normalize both city columns for consistent comparison
df_train['city_norm'] = df_train['city'].str.title().str.strip()
us_dataset['city_norm'] = us_dataset['CITY'].str.title().str.strip()
# Create a set of valid cities for fast lookup
valid_cities = set(us_dataset['city_norm'])
# Check validity
match_ratio = df_train['city_norm'].isin(valid_cities).mean()
print(f"{match_ratio:.2%} of cities in the dataset match valid U.S. cities")
# Mismatches
invalid_cities = df_train.loc[~df_train['city_norm'].isin(valid_cities), 'city'].unique()[:20]
print("Sample of non-matching cities:", invalid_cities)
99.58% of cities in the dataset match valid U.S. cities Sample of non-matching cities: ['New York City' 'Pembroke Township']
# Check for near matches / alternative naming
us_dataset[us_dataset['CITY'].str.contains("New York", case=False)]
us_dataset[us_dataset['CITY'].str.contains("Pembroke", case=False)]
| ID | STATE_CODE | STATE_NAME | CITY | COUNTY | LATITUDE | LONGITUDE | city_norm | |
|---|---|---|---|---|---|---|---|---|
| 4057 | 4058 | FL | Florida | Pembroke Pines | Broward | 26.02 | -80.30 | Pembroke Pines |
| 4638 | 4639 | GA | Georgia | Pembroke | Bryan | 32.16 | -81.55 | Pembroke |
| 9351 | 9352 | KY | Kentucky | Pembroke | Christian | 36.80 | -87.33 | Pembroke |
| 10379 | 10380 | MA | Massachusetts | North Pembroke | Plymouth | 42.09 | -70.79 | North Pembroke |
| 10406 | 10407 | MA | Massachusetts | Pembroke | Plymouth | 42.06 | -70.80 | Pembroke |
| 11329 | 11330 | ME | Maine | Pembroke | Washington | 44.97 | -67.20 | Pembroke |
| 15445 | 15446 | NC | North Carolina | Pembroke | Robeson | 34.69 | -79.18 | Pembroke |
| 18313 | 18314 | NY | New York | East Pembroke | Genesee | 43.00 | -78.31 | East Pembroke |
| 27148 | 27149 | VA | Virginia | Pembroke | Giles | 37.33 | -80.62 | Pembroke |
Only two cities failed to match exactly: New York City and Pembroke Township. A closer inspection shows that:
New York Citycorresponds toNew Yorkin the U.S. Cities Database, which is the same geographic entity differing only by the "City" suffixPembroke Townshipaligns with multiple validPembrokelocations across states such as Florida, Georgia, and North Carolina - all legitimate U.S municipalities.
These discrepancies stem purely from synthetic naming variations introduced by the simulator, not from invalid or missing data. Therefore:
The
cityfeature is structurally complete, geographically accurate, and free from semantic inconsistenciesNo cleaning or data correction is required
Let's validate street values. The street values are simulated, so we can't verify them against a real-world database. However, we can still check data structure integrity, ensuring they look like real street names and are diverse:
# Checking diversity
print(df_train['street'].sample(10).tolist())
['72269 Elizabeth Field Apt. 132', '7529 Carter Well Suite 262', '41851 Victor Drives Suite 219', '220 Frank Gardens', '597 Jenny Ford Apt. 543', '37910 Ward Lights', '663 Anna Plaza', '144 Martinez Curve', '6970 Blake Trail', '950 Sheryl Spurs']
The street names are looking realistic, and there are clearly no missing or clearly invalid entries. This is expected and we can therefore continue checking the zip feature next. We will check that ZIP codes follow valid U.S. formatting - numeric and in range (00500-99950):
invalid_zips = df_train[(df_train['zip'] < 501) | (df_train['zip'] > 99950)]
print(f"Invalid ZIP codes: {len(invalid_zips)}")
Invalid ZIP codes: 0
As we can see, all ZIPs are valid and within range - confirming geographical plausibility.
Finally, we verify that population values are positive, realistic, and demographically plausible
print(df_train['city_pop'].describe())
count 1,296,675.00 mean 88,824.44 std 301,956.36 min 23.00 25% 743.00 50% 2,456.00 75% 20,328.00 max 2,906,700.00 Name: city_pop, dtype: float64
non_integer_pop = df_train[~(df_train['city_pop'] % 1 == 0)]
print(f"Number of non-integer population entries: {len(non_integer_pop)}")
Number of non-integer population entries: 0
The city_pop feature shows a strongly right-skewed distribution, which perfectly aligns with the real demographic structure of the United States:
Most records come from smaller towns or suburban areas - reflected in the low median (≈2500 residents)
The upper quartile (≈20,000) represents medium-sized cities
The extreme tail (max ≈2.9M) corresponds to major metropolitan areas like New York, Los Angeles, or Chicago.
The mean (≈88K) being much higher than the median confirms the long-tail nature of U.S. urban populations - many small towns and few very large cities.
There are no negative or unrealistic values, and all population magnitudes are demographically plausible (ranging from small rural communities to dense urban centers). This confirms that the city_pop feature is clean.
Geographical Fraud Analysis¶
Having verified the integrity and realism of our geographical data, we now move from validation to spatial exploration, using geographic and demographic attributes to uncover patterns that explain where and why fraudulent activity occurs.
We focus on three main research questions:
Distance-Fraud Relationship:
Do transactions that occur farther from the cardholder's location have a higher likelihood of being fraudulent?
Population-Fraud Correlation:
Is fraud more prevalent in densely populated cities, or do smaller towns experience disproportionately higher fraud rates?
State-Level Fraud Analysis:
Which states contribute the most to total fraud, and which exhibit the highest fraud rates relative to their transaction volumes?
Map Visualization:
# Prepare city-level fraud stats
city_stats = df_train.groupby('city').agg(
total_txn =('is_fraud', 'count'),
total_fraud = ('is_fraud', 'sum'),
avg_population=('city_pop', 'mean')
)
city_stats['fraud_rate'] = (city_stats['total_fraud'] / city_stats['total_txn'])
# Normalize city names in both datasets
city_stats = city_stats.reset_index()
city_stats['city_norm'] = city_stats['city'].str.title().str.strip()
us_dataset['city_norm'] = us_dataset['CITY'].str.title().str.strip()
# Merge with coordinates
city_map_data = pd.merge(
city_stats,
us_dataset[['city_norm', 'LATITUDE', 'LONGITUDE', 'STATE_CODE', ]],
on= 'city_norm',
how='inner'
)
# Initialize map centered on continental US
fraud_map = folium.Map(location=[37.5, -96.5], zoom_start=4, tiles='CartoDB positron')
# Create cluster for better performance
marker_cluster = MarkerCluster().add_to(fraud_map)
# Add markers
for _, row in city_map_data.iterrows():
if row['total_txn'] < 30: # skip small cities (low data reliability)
continue
popup_text = (f"<b>City:</b> {row['city_norm']}<br>"
f"<b>State:</b> {row['STATE_CODE']}<br>"
f"<b>Population:</b> {int(row['avg_population']):,}<br>"
f"<b>Fraud Rate:</b> {row['fraud_rate']:.2f}%<br>"
f"<b>Fraud Cases:</b> {int(row['total_fraud'])}<br>"
f"<b>Total Transactions:</b> {int(row['total_txn'])}")
# Color code - higher fraud rate = darker red
color = 'green' if row['fraud_rate'] < 0.5 else 'orange' if row['fraud_rate'] < 2 else 'red'
folium.CircleMarker(
location=[row['LATITUDE'], row['LONGITUDE']],
radius=max(3, min(row['fraud_rate'] / 2, 10)), # scale size with fraud rate
color=color,
fill=True,
fill_opacity=0.6,
popup=popup_text
).add_to(marker_cluster)
fraud_map
Graph 16 - Interactive Map of Fraud Rate
The map above visualizes the spatial distribution of fraud cases across US. Each marker represents a city, colored-coded by its fraud intensity and scaled by the relative severity of fraudulent activity
🟢 Green markers represent low-risk cities (fraud rate < 0.5%)
🟠 Orange markers indicate moderate-risk regions (0.5-2%)
🔴 Red markers highlight fraud hotspots (>2%)
Here are the key observations:
Widespread moderate activity:
The majority of US cities display yellow-orange markers, indicating moderate levels of fraud between 0.5-2%. This suggests that fraud is not isolated to specific regions, but rather distributed across the entire country, consistent with the idea that digital and card-based fraud is a nationwide phenomenon.
Concentration in the South and East:
Noticeably higher fraud rates are observed in the southern and eastern United States Cities, including dense urban corridors and economically active states. These regions host a higher concentration of large metropolitan areas, which naturally experience greater transaction volume, and thus higher exposure to fraud attempts.
Peripheral regions with low risk:
Outlying cities in areas such as Alaska, Hawaii and Puerto Rico (San Juan) show mainly green points, reflecting very low fraud rates. This pattern likely results from lower population
Population does not directly predict fraud:
Despite including population data in the visualization, there is no clear linear correlation between city size and fraud rate. Some highly populated cities (e.g., large metro areas) exhibit moderate fraud rates, while certain smaller towns demonstrate disproportionately high rates. This highlights that fraud exposure is influenced not only by population but also by factors such as economic activity, transaction diversity, and local enforcement intensity.
Overall, the interactive map illustrates that fraudulent activity in the US is both geographically diverse and spatially correlated - areas with dense commerce and urban concentration tend to attract more fraud attempts, yet smaller, less populated regions are not immune.
Distance-Fraud Relationship:
def haversine(lat1, lon1, lat2, lon2):
# Convert degrees to radians
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Earth radius in km
return c * r
# Apply to dataframes
df_train['distance_cardholder_merchant'] = haversine(
df_train['lat'], df_train['long'], df_train['merch_lat'], df_train['merch_long']
)
df_test['distance_cardholder_merchant'] = haversine(
df_test['lat'], df_test['long'], df_test['merch_lat'], df_test['merch_long']
)
# Bin distances into ranges (for clarity)
bins = [0, 1, 5, 10, 50, 100, 500, 1000, 5000, np.inf]
labels = ["<1km", "1–5km", "5–10km", "10–50km", "50–100km", "100–500km", "500–1000km", "1000–5000km", ">5000km"]
df_train['distance_group'] = pd.cut(df_train['distance_cardholder_merchant'], bins=bins, labels=labels)
# Compute fraud rate per distance group
distance_fraud_stats = (
df_train.groupby('distance_group', observed=False)['is_fraud']
.agg(['count', 'sum'])
.rename(columns={'count': 'Total Txns', 'sum': 'Fraud Txns'})
)
distance_fraud_stats['Fraud Rate (%)'] = 100 * distance_fraud_stats['Fraud Txns'] / distance_fraud_stats['Total Txns']
display(distance_fraud_stats)
| Total Txns | Fraud Txns | Fraud Rate (%) | |
|---|---|---|---|
| distance_group | |||
| <1km | 106 | 1 | 0.94 |
| 1–5km | 2598 | 9 | 0.35 |
| 5–10km | 7862 | 43 | 0.55 |
| 10–50km | 254630 | 1430 | 0.56 |
| 50–100km | 728628 | 4276 | 0.59 |
| 100–500km | 302851 | 1747 | 0.58 |
| 500–1000km | 0 | 0 | NaN |
| 1000–5000km | 0 | 0 | NaN |
| >5000km | 0 | 0 | NaN |
We used the Haversine distance function to calculate the shortest distance (in kms) between two points on Earth's surface using their latitude and longitude coordinates.
Using the distances between the cardholder's coordinates and the merchant's location, transactions were grouped into distance intervals to analyze how physical proximity influences fraud probability.
Based on the findings:
Majority of transactions are local or regional, since nearly all transactions fall within 500km. There are no transactions beyond that range.
There is a slightly elevated fraud rate for very short distances, for instance, transactions occurring within less than 1 km from the cardholder's registered location show a fraud rate of 0.94%, slightly higher than the average across other bins. This might have a connection with the usage of online services or network purchases
Based on those findings, we can see that distance might not necessarily be the key predictor in this dataset, however, it can still be useful for finding patterns when combined with other features.
Zip feature usefulness
The zip feature includes around 970 unique values, indicating realistic granularity across the dataset. This confirms that the ZIP code field is syntactically valid and exhibits sufficient diversity to represent geographically distributed transactions:
zip_city_corr = df_train.groupby(['city', 'zip']).size().reset_index(name='count')
print(f"Number of (city, zip) pairs: {len(zip_city_corr)} vs total cities: {df_train['city'].nunique()}")
Number of (city, zip) pairs: 970 vs total cities: 894
However, further exploration revealed that ZIP codes are highly correlated with city - with 969 unique (city,ZIP) pairs across 894 cities, yielding an almost one-to-one relationship. This means ZIP codes, while structurally valid, do not contribute additional geographic insight beyond the city field. Therefore, the zip feature will be excluded from subsequent analysis to reduce redundancy
df_train = df_train.drop(columns=['zip'])
df_test = df_test.drop(columns=['zip'])
print("ZIP column dropped from both train and test datasets")
ZIP column dropped from both train and test datasets
Our comprehensive geographic exploration demonstrates that the dataset's spatial features are accurate, realistic, and analytically reliable, offering a solid foundation for spatial fraud modeling.
All geographic attributes were validated against verified U.S. data, and the dataset showed near-perfect consistency, with only minimal outliers likely representing U.S. territories or synthetic noise
Fraudulent activity is broadly distributed across the United States rather than localized to specific regions. The interactive map highlights moderate fraud intensity (0.5 - 2%) in most areas, with denser fraud presence in the South and East, where economic activity and transaction volumes are higher
The average transaction distance (~76km) is virtually identical for both legitimate and fraudulent transactions, implying that distance is not a discriminative feature in this dataset, but remains valuable for cross-temporal analysis
Population size alone does not predict fraud - both small and large cities experience similar fraud rates. This suggests that fraud exposure depends more on economic behavior and transaction diversity than on city size.
A preliminary assessment shows that ZIP codes likely provide redundant geographic information, strongly correlated with
city. They can be safely omitted from further modeling or visualization to simplify the analysis
In conclusion, the geographic features are clean, interpretable, and diverse, supporting the reliability of subsequent modeling tasks while confirming that fraudulent activity is geographically widespread rather than isolated
🔨 job¶
The job feature represents the cardholder's occupation. Occupational data can reveal socioeconomic and behavioral patterns that correlate with both transaction behavior and fraud vulnerability.
From a behavioral perspective:
Certain professions (for instance, executives, engineers, or salespeople) may show higher transaction volume due to lifestyle or travel
Jobs with frequent online spending or travel might face greater fraud exposure
Conversely, other professions might exhibit lower fraud risk, potentially due to fewer high-value purchases or less card usage
However, analyzing this feature requires caution, we must first make sure the data is clean, meaningful and ethically interpreted, since job-based profiling could introduce bias if misused:
Data Integrity¶
print(f"Number of unique jobs in the training set: {df_train['job'].nunique()}")
print(f"Number of unique jobs in the test set: {df_test['job'].nunique()}")
print("\nSample job from the training set:")
print(df_train['job'].sample(10).tolist())
Number of unique jobs in the training set: 494 Number of unique jobs in the test set: 478 Sample job from the training set: ['Loss adjuster, chartered', 'Waste management officer', 'Education administrator', 'Chief Executive Officer', 'Sports development officer', 'Firefighter', 'Secondary school teacher', 'Heritage manager', 'Web designer', 'Systems developer']
The dataset contains ≈ 500 unique job titles, covering a wide range of occupations, from Neurosurgeon to Tax adviser. There are no missing entries, confirming that the feature is structurally complete. However, the large number of categories introduces sparsity: most job titles appear only a handful of times. This sparsity may limit direct interpretability and requires aggregation or encoding (e.g., target encoding, frequency grouping) for machine learning
Job Frequency and Fraud Rate¶
Let's explore which occupations appear most frequently and whether certain jobs exhibit higher than average fraud rates
job_freq = df_train['job'].value_counts().head(15)
print(job_freq)
job Film/video editor 9779 Exhibition designer 9199 Naval architect 8684 Surveyor, land/geomatics 8680 Materials engineer 8270 Designer, ceramics/pottery 8225 Systems developer 7700 IT trainer 7679 Financial adviser 7659 Environmental consultant 7547 Chartered public finance accountant 7210 Scientist, audiological 7174 Chief Executive Officer 7172 Copywriter, advertising 7146 Comptroller 6730 Name: count, dtype: int64
The most common occupations in the training set include technical, creative, and financial professions, such as Film/Video Editor, Exhibition Designer, Naval Architect, and Financial Adviser. This reflects a diverse yet synthetic occupational landscape, where roles were randomly assigned to ensure variety rather than mirroring real-world job frequency.
The dominance of certain creative and technical titles also suggests that the dataset is balanced by design rather than by socioeconomic distribution, meaning job frequencies do not represent actual labor-market proportions, they are primarily useful for behavioral segmentation and categorical encoding during modeling
# Filter jobs with sufficient data
filtered_jobs = (
df_train.groupby('job')['is_fraud']
.agg(['count', 'mean'])
.rename(columns={'count': 'total_txn', 'mean': 'fraud_rate'})
.query('total_txn >= 100') # filter out rare occupations for sufficient representation
.sort_values(by='fraud_rate', ascending=False)
)
filtered_jobs['fraud_rate'] *= 100
print(f"Number of job categories after filtering: {filtered_jobs.shape[0]}")
filtered_jobs.head(15)
Number of job categories after filtering: 475
| total_txn | fraud_rate | |
|---|---|---|
| job | ||
| Lawyer | 540 | 5.19 |
| TEFL teacher | 533 | 4.13 |
| Community development worker | 536 | 4.10 |
| Clinical cytogeneticist | 508 | 3.54 |
| Writer | 504 | 2.98 |
| Geneticist, molecular | 545 | 2.94 |
| Conservator, museum/gallery | 514 | 2.92 |
| Magazine journalist | 533 | 2.63 |
| Field trials officer | 518 | 2.51 |
| Civil Service administrator | 506 | 2.37 |
| Medical technical officer | 1066 | 2.35 |
| Charity officer | 519 | 2.31 |
| Pharmacist, hospital | 1059 | 2.27 |
| Minerals surveyor | 530 | 2.26 |
| Engineer, structural | 492 | 2.24 |
plt.figure(figsize=(10,6))
sns.barplot(
y=filtered_jobs.head(15).index,
x=filtered_jobs.head(15)['fraud_rate'],
hue=filtered_jobs.head(15).index,
palette='Reds_r',
legend=False
)
plt.title('Top 15 Jobs by Fraud Rate (%) - Filtered (≥100 transactions)')
plt.xlabel('Fraud Rate (%)')
plt.ylabel('Job Title')
plt.show()
Graph 17 - Top 15 jobs by Fraud Rate (Filtered)
From the following observations we can conclude that:
Fraud rates are realistic and stable, ranging roughly between 2-5% among the top jobs. This validates our filtering step as it removes random outliers caused by small-sample bias
The top jobs span multiple sectors - law, education, healthcare, journalism, and science - suggesting no single occupational domain dominates in fraudulent activity. Instead, fraud appears evenly distributed across various professions, which is consistent with synthetic data where fraud is not occupationally biased.
Some profession like lawyers, consultants or writers may reflect higher transaction independence - individuals who manage their own payments, travel, or online activities, possibly increasing exposure to fraud-like transactions. Interestingly, several specialized scientific and medical professions appear among the higher-fraud categories. However, these anomalies are most likely artifacts of the synthetic data generation process, where occupations were randomly assigned and not causally linked to fraud risk, resulting in spurious correlations that do not reflect real-world behavior.
Each listed occupation has around 500 - 1000 transactions, confirming adequate sample size and reliability. This means small fluctuations might occur due to random variance but still indicate coherent model behavior.
Overall, the job feature demonstrates strong data completeness and semantic richness, offering potential insights into behavioral differences among cardholders. However, due to its high cardinality and synthetic assignment, raw job titles are not directly predictive of fraud risk. The feature remains valuable for modeling when transformed, for example, through frequency encoding, target encoding, or sector based grouping - which will help capture broad socioeconomic patterns without introducing overfitting.
Having validated and explored occupational patterns, we can now turn to another demographic variable - the dob (date of birth) feature - to investigate whether age-related behavioral trends influence fraud likelihood
👶 dob¶
The dob feature captures the date of birth of the cardholder, representing a fundamental demographic attribute. Age-related information can be an important factor in fraud detection, as spending habits, digital literacy and risk exposure can vary significantly across age groups.
From an analytical perspective:
Younger users might exhibit more online or mobile-driven spending, possibly increasing exposure to digital fraud
Middle-aged users often perform higher-value transactions, making them more attractive targets for fraudsters
Older users may show more stable spending patterns, which can make deviations more detectable.
Before drawing any conclusions, we first verify that the dob feature is well-structured and realistic:
Data Integrity¶
The first step in validating the dob feature is to confirm that all date values are correctly formatted and fall within plausible human age boundaries. To do so, we inspect the minimum and maximum birthdates and visualize their chronological distribution
df_train['dob'] = pd.to_datetime(df_train['dob'], errors='coerce')
df_test['dob'] = pd.to_datetime(df_test['dob'], errors='coerce')
print(f"Minimum DOB: {df_train['dob'].min()}")
print(f"Maximum DOB: {df_train['dob'].max()}")
Minimum DOB: 1924-10-30 00:00:00 Maximum DOB: 2005-01-29 00:00:00
Legal and logical context:
Credit card eligibility varies by country, but typically in the U.S., the minimum age is 18 to open a credit account in one's own name. Minors can only have a card as authorized users on a parent's account. So in a U.S. - based dataset like this one, individuals born after 2001 (under 18 in 2019) would be highly suspicious or implausible as cardholders - they should either be authorized users, not primary cardholders, or represent synthetic noise from the data generator
Let's first extract all transactions made by cardholders under 18 years old, since this violates typical credit-card eligibility rules (In the U.S). We will examine their count and fraud ratio
# Compute age column
df_train['age'] = (df_train['year'] - df_train['dob'].dt.year)
df_test['age'] = (df_test['year'] - df_test['dob'].dt.year) # 'age' will be used as a feature in the training process, therefore we added it to test set
print(f"Minimum age: {df_train['age'].min()}")
print(f"Maximum age: {df_train['age'].max()}")
print(df_train[['dob', 'date', 'age']].head())
Minimum age: 14
Maximum age: 96
dob date age
0 1988-03-09 2019-01-01 31
1 1978-06-21 2019-01-01 41
2 1962-01-19 2019-01-01 57
3 1967-01-12 2019-01-01 52
4 1986-03-28 2019-01-01 33
Now that we have age column, we can drop dob column from the training and test sets:
df_train = df_train.drop(columns=['dob'], errors='ignore')
df_test = df_test.drop(columns=['dob'], errors='ignore')
print("dob column dropped from both train and test datasets")
dob column dropped from both train and test datasets
illegal_age_mask = df_train['age'] < 18
illegal_age_txn = df_train[illegal_age_mask]
print(f"Number of transactions with cardholders under 18: {len(illegal_age_txn)}")
Number of transactions with cardholders under 18: 13430
# Fraud ratio
fraud_ratio_illegal = illegal_age_txn['is_fraud'].mean() * 100
print(f"Fraud rate among underage transactions: {fraud_ratio_illegal:.2f}%")
Fraud rate among underage transactions: 0.45%
We've found 13,430 transactions that are linked to cardholders under 18 years old. Given that our training set has around 1.29 million rows, that's roughly 1% of all transactions. This confirms that underage entries exist but represent a small fraction of the dataset. Moreover, the fraud rate among these under-18 records is 0.45%, which only suggests that the presence of underaged cardholders likely does not encode a special behavioral signal related to fraud. Instead, it is more likely consistent with random simulation variance from the data generator
Therefore, given that the dataset is synthetically generated, these entries are retained in the dataset, as they do not distort statistical distributions or introduce bias. Their inclusion helps preserve the dataset's overall structure and ensures that subsequent models are trained on the full synthetic variety of profiles
Let us now check the age distribution among cardholders:
plt.figure(figsize=(10,5))
sns.histplot(df_train['age'], bins=40, kde=True, color='mediumseagreen')
plt.title('Distribution of Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Graph 18 - Distribution of Ages
The age distribution is realistic, continuous and clean, with no formatting errors or implausible outliers. Most cardholders are young to middle-aged adults (25-55 year old), which aligns with the profile of typical active credit card users. A small fraction of younger entries (under 18) and older entries (above 85) are also present. Both represent natural statistical tails of the synthetic population rather than real anomalies. These groups are rare and it reflects the real-world scenario, where elderly and younger generations are more often expected to use cash rather than credit cards
Overall, the dob feature is clean and reliable, and the derived age variable can be confidently used in analysis and modelling part
Average transaction amount per age¶
age_stats = df_train.groupby('age').agg(
avg_amt=('amt', 'mean'),
fraud_rate=('is_fraud', 'mean'),
transaction_count=('is_fraud', 'count')
).reset_index()
age_stats['fraud_rate'] *= 100 # Convert fraud rate to percentage
plt.figure(figsize=(10,5))
sns.lineplot(data=age_stats, x='age', y='avg_amt', color='steelblue')
plt.title('Average Transaction Amount by Age')
plt.xlabel('Age')
plt.ylabel('Average Amount')
plt.grid(True, linestyle ='--', alpha=0.5)
plt.show()
Graph 19 - Average Transaction Amount by Age
The visualization above shows the average transaction amount for each cardholder age
Early adulthood (15-25 years old):
- Spending is lower and unstable, with noticeable fluctuations
- This group likely includes students or early-career individuals, making smaller, inconsistent purchases
- The few very young ages (below 18) are rare outliers and may explain the short dip near age 20
Prime working years (30-45 years old):
The highest average spending occurs in this range, peaking around the mid-30s to early 40s (~$80-83 per transaction)
This reflects typical increased financial capacity and more frequent high-value transactions, consistent with income growth and family-related expenses
Middle to older adults (50-70 years old):
The average amount gradually declines, stabilizing around $65-70
This suggests reduced purchase frequency or more controlled spending habits as individuals age
Seniors (70+ years old):
Spending remains moderate but erratic, likely due to smaller sample sizes and synthetic noise in later ages.
The absence of a clear upward or downward trend beyond 80 supports that the data is synthetic but statistically stable
plt.figure(figsize=(10,5))
sns.lineplot(data=age_stats, x='age', y='fraud_rate', color='firebrick')
plt.title('Fraud Rate by Age (%)')
plt.xlabel('Age')
plt.ylabel('Fraud Rate (%)')
plt.grid(True, linestyle ='--', alpha=0.5)
plt.show()
Graph 20 - Fraud Rate by Age (%)
Several distinct patterns emerge from the following graph:
Overall fraud rates remain low
- Across nearly all ages, the fraud rate fluctuates between 0.3% and 1%, with a few random spikes. This confirms that the dataset maintains a realistic fraud prevalence consistent with typical financial data
Younger users (under 25)
Fraud levels are noisier and occassionally spike (around age 18-20), likely due to small sample size and low transaction volume among minors and new credit users
These brief peaks are not meaningful behavioral signals
Prime working age (25 - 60)
Fraud rate is relatively stable, hovering around 0.5-0.8%
This suggests a balanced risk distribution , no single adult age range is disproportionately targeted
Consistent fraud levels here likely reflect well-distributed spending and exposure across this demographic
Older adults (70+)
There is a slight upward drift and more volatility starting around age 70, with peaks exceeding 1.5 - 1.8%
This may correspond to lower digital literacy, less frequent account monitoring or targeted fraud attempts - a pattern that, in real-world data, often reflects heightened vulnerability among elderly populations.
However, since this dataset is synthetically generated, these fluctuations might also arise from random noise rather than genuine behavioral effects
To verify this, we will statistically test whether the higher fraud rate observed in older age groups represents a consistent pattern or simply a random variance caused by limited sample size
Statistical test:
We will conduct a two-proportion Z-test to determine whether the increase in fraud prevalence among elderly cardholders is merely random variance in the dataset, or real pattern in the dataset. We will compare the fraud rate of elderly users (≥70 years) against that of all younger users (<70 years)
# Define elderly threshold (70 years and above)
elderly_mask = df_train['age'] >= 70
# Fraud counts
fraud_elderly = df_train.loc[elderly_mask, 'is_fraud'].sum()
fraud_non_elderly = df_train.loc[~elderly_mask, 'is_fraud'].sum()
# Sample sizes
n_elderly = elderly_mask.sum()
n_non_elderly = (~elderly_mask).sum()
# Run two-proportion z-test
count = np.array([fraud_elderly, fraud_non_elderly])
nobs = np.array([n_elderly, n_non_elderly])
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.5f}")
# Calculate fraud rates for reference
fraud_rate_elderly = (fraud_elderly / n_elderly) * 100
fraud_rate_non_elderly = (fraud_non_elderly / n_non_elderly) * 100
print(f"Fraud rate (Elderly 70+): {fraud_rate_elderly:.3f}%")
print(f"Fraud rate (Non-Elderly <70): {fraud_rate_non_elderly:.3f}%")
Z-statistic: 13.324 P-value: 0.00000 Fraud rate (Elderly 70+): 0.832% Fraud rate (Non-Elderly <70): 0.548%
The Z-test returned a z = 13.32 and p < 0.000001, confirming that the elderly group exhibits a statistically higher fraud rate (0.83%) compared to younger users (0.55%). This indicates that the apparent increase is not random noise, but a systematic pattern within the dataset, potentially reflecting realistic demographic vulnerability or an intentional behavior embedded in the data generator
The derived age variable shows excellent structural quality and meaningful behavioral variation. Age influences both spending patterns and fraud exposure, with the middle-aged group (30-45) showing the highest spending intensity and elderly users (70+) exhibiting a statistically significant increase in fraud rate. While the dataset is synthetic, the relationship between age and fraud aligns with plausible real-world dynamics, suggesting that the data generator encoded age-dependent spending and risk behavior realistically. Therefore, age can be confidently retained as a valuable predictive feature, both for behavioral segmentation and for improving the model's ability to detect fraud across demographic groups
😵 is_fraud¶
The is_fraud feature is the target label indicating whether a transaction is fraudulent (1) or legitimate (0)
fraud_counts = df_train['is_fraud'].value_counts()
plt.figure(figsize=(4, 4))
plt.pie(fraud_counts, labels=['Not Fraud (0)', 'Fraud (1)'], autopct='%1.2f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title("Distribution of Fraudulent vs. Non-Fraudulent Transactions")
plt.show()
Graph 21 - Distribution of Fraudulent vs. Non-Fraudulent Transactions
The dataset shows a strong class imbalance: nearly all transactions are non-fraudulent, while only a tiny fraction (≈0.5%) represent actual fraud
This mirrors real-world financial datasets, where fraudulent transactions are rare but high-impact events
Such imbalance poses a serious modeling challenge:
Models trained on raw data may default to predicting non-fraud to achieve high accuracy but low recall
Consequently, they may fail to detect rare frauds, which are the most critical to identify
To address this, during the modeling phase, we will consider:
Resampling techniques such as SMOTE (Synthetic Minority Oversampling Technique)
Cost-sensitive learning, by assigning higher class weights to fraud cases
Evaluation metrics beyond accuracy, such as Precision, Recall, F1-Score and ROC-AUC, to ensure fair assessment under imbalance
Dropping helper features
During the EDA process, we created helper columns to build graphs and visualizations, as well as to conduct certain statistical tests for data integrity. These columns do not contribute to the training process and are redundant. However, we will still use some features for the sake of feature engineering in the later sections, therefore, we will keep a raw copy of the original dataset before dropping the columns:
df_train.columns
Index(['cc_num', 'merchant', 'category', 'amt', 'gender', 'street', 'city',
'state', 'lat', 'long', 'city_pop', 'job', 'merch_lat', 'merch_long',
'is_fraud', 'hour', 'day_of_week', 'month', 'year', 'date', 'suffix',
'city_norm', 'distance_cardholder_merchant', 'distance_group', 'age'],
dtype='object')
drop_cols = [
'suffix',
'lat',
'long',
'merch_lat',
'merch_long',
'city_norm',
'street', # Was used for the purpose of EDA, but is redundant for training
'year', # used for the purpose of EDA, but has only 2 values which are not sufficient enough for training
'distance_group'
]
df_train.drop(columns=[col for col in drop_cols if col in df_train.columns], inplace=True )
df_test.drop(columns=[col for col in drop_cols if col in df_test.columns], inplace=True )
# Confirm the structure
print("Dropped helper columns. Remaining features:")
print(df_train.columns)
Dropped helper columns. Remaining features:
Index(['cc_num', 'merchant', 'category', 'amt', 'gender', 'city', 'state',
'city_pop', 'job', 'is_fraud', 'hour', 'day_of_week', 'month', 'date',
'distance_cardholder_merchant', 'age'],
dtype='object')
# Copies for Feature Engineering
df_train_raw = df_train.copy()
df_test_raw = df_test.copy()
print("created df_train_raw and df_test_raw")
# For unsupervised learning, remove 'date' and any columns unsuitable for distance-based algorithms
df_train = df_train.drop(columns=['date'])
df_test = df_test.drop(columns=['date'])
created df_train_raw and df_test_raw
Unsupervised Learning: PCA, t-SNE, and Clustering¶
Beyond predictive modeling, unsupervised learning allows us to explore the dataset's intrinsic structure without relying on the target label.
By projecting high-dimensional transactions into lower dimensions, we can visualize how naturally the data forms clusters, and whether those clusters correspond to fraudulent behavior.
We focus on two complementary techniques:
| Method | Purpose | Characteristics |
|---|---|---|
| PCA | Linear dimensionality reduction | Captures global variance; fast and interpretable |
| t-SNE | Non-linear manifold learning | Preserves local neighborhoods; excellent for revealing small clusters |
We will also apply K-Means clustering to the transformed data to detect hidden groupings and assess their correspondence with the known fraud labels.
Encoding Strategy for Unsupervised Learning
Unsupervised methods are sensitive to how we turn raw columns into numbers. Distances and variances come from the encodings, so we want compact, leakage-free representations that don't explode dimensionality.
Encodings types that will be safe for our usage:
One-Hot-Encoding: will be used for low-cardinality categoricals (like
genderandcategory) because of its interpretabilityFrequency Encoding - mapping each category to its relative frequency in the dataset (for features like
merchant,job,city,stateandcc_num). The pro here is that it is stable, and it preserves global distribution signalCyclical encodings for time -
hour,day_of_week,monthwill turn into sin/cos pair transformations (which respect periodicity and work great with PCA/K-Means)
For supervised learning, we will introduce fraud rate encoding for the implementation of supervised models, but we will not use it here
Frequency Encoding:
class FrequencyEncoder(BaseEstimator, TransformerMixin):
"""
Encodes categorical features by their frequency (normalized counts),
replacing the original categorical columns
"""
def __init__(self, min_freq=0, normalize=True):
self.min_freq = min_freq
self.normalize = normalize
self.freq_maps_ = {}
def fit(self, X, y=None):
X = pd.DataFrame(X).copy()
for col in X.columns:
counts = X[col].value_counts(normalize=self.normalize)
if self.min_freq > 0:
threshold = self.min_freq if self.normalize else int(self.min_freq)
counts = counts[counts >= threshold]
self.freq_maps_[col] = counts.to_dict()
return self
def transform(self, X):
X = pd.DataFrame(X).copy()
for col in X.columns:
mapping = self.freq_maps_.get(col, {})
X[col] = X[col].map(mapping).fillna(0)
return X.values
Cyclical Encoding:
class CyclicalTimeEncoder(BaseEstimator, TransformerMixin):
"""
Encodes periodic time features (like hour, day_of_week, month)
into sine and cosine components to preserve their cyclical nature.
Automatically handles both numeric and string inputs.
"""
def __init__(self, period_map=None):
self.period_map = period_map or {}
self.feature_names_out_ = []
# Defined mappings for text-based time features
self.day_map = {
'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3,
'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
self.month_map = {
'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5,
'June': 6, 'July': 7, 'August': 8, 'September': 9,
'October': 10, 'November': 11, 'December': 12
}
def fit(self, X, y=None):
self.columns_ = X.columns.tolist()
return self
def transform(self, X):
X = pd.DataFrame(X).copy()
result = pd.DataFrame(index=X.index)
self.feature_names_out_ = []
for col in X.columns:
period = self.period_map.get(col, None)
if period is None:
raise ValueError(f"No period specified for column '{col}'")
# Handle string-based days or months automatically
if X[col].dtype == 'object':
if col == 'day_of_week':
X[col] = X[col].map(self.day_map)
elif col == 'month':
X[col] = X[col].map(self.month_map)
X[col] = pd.to_numeric(X[col], errors='coerce')
# Apply sin and cos transformation
result[f"{col}_sin"] = np.sin(2 * np.pi * X[col] / period)
result[f"{col}_cos"] = np.cos(2 * np.pi * X[col] / period)
self.feature_names_out_.extend([f"{col}_sin", f"{col}_cos"])
return result.values
def get_feature_names_out(self, input_features=None):
return np.array(self.feature_names_out_, dtype=object)
Small explanation:
We define the cyclical columns and their periods (e.g., 24 hours in a day, 7 days in a week, 12 months in a year)
The transformer computes
sin(2πx / period)andcos(2πx / period)for each columnIt outputs all of them as new numeric features, useful for PCA, clustering, etc.
Data Preparation:
# Separate features (X) and target label (y)
X = df_train.drop(columns=['is_fraud'])
y = df_train['is_fraud'].astype(int)
# Define Column Groups
card_col = ['cc_num']
high_card_cols = ["merchant", "job", "city", "state"]
low_card_cols = ["gender", "category"]
time_cols = ["hour", "day_of_week", "month"]
exclude_cols = ["is_fraud"] + high_card_cols + low_card_cols + time_cols
num_cols = (
df_train.select_dtypes(include=["int64", "float64", "int32", "float32"])
.columns.difference(exclude_cols)
.tolist()
)
print("Numeric Columns:", num_cols)
Numeric Columns: ['age', 'amt', 'cc_num', 'city_pop', 'distance_cardholder_merchant']
# preprocessing transformer
preprocess_unsupervised = ColumnTransformer(
transformers=[
# Frequency encoding for high-cardinality categorical features
("freq_high", FrequencyEncoder(), high_card_cols),
# Frequency encoding for card number (activity-based encoding)
("freq_card", FrequencyEncoder(), card_col),
# One-hot encoding for low-cardinality categorical features
("onehot_low", OneHotEncoder(handle_unknown="ignore", sparse_output=False), low_card_cols),
# Cyclical time encoding (hour, day_of_week, month)
("cyclical_time", CyclicalTimeEncoder(period_map={
'hour': 24,
'day_of_week': 7,
'month': 12
}), time_cols),
# Min-Max scaling for continuous numeric features
("scaler", MinMaxScaler(), num_cols),
],
remainder="drop",
verbose_feature_names_out=False,
)
Applying The Transformer:
X_prepared = preprocess_unsupervised.fit_transform(X)
print("Final transformed shape:", X_prepared.shape)
Final transformed shape: (1296675, 32)
〽 PCA¶
Principal Component Analysis (PCA) is applied to uncover the dominant directions of variance in the dataset and to evaluate how many components are sufficient to represent most of its information.
By projecting the data into orthogonal axes that capture maximal variance, we can identify the intrinsic dimensionality of the transaction space and prepare for lower-dimensional visualization or clustering:
# Initialize PCA
pca = PCA(n_components=None, random_state=42)
X_pca_full = pca.fit_transform(X_prepared)
# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Plot
plt.figure(figsize=(8,5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.title("Explained Variance by Principal Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid(True)
plt.tight_layout()
plt.show()
# Show first few component contributions
for i, var in enumerate(explained_variance_ratio[:5], 1):
print(f"Component {i}: {var:.4f} variance explained")
print(f"\nTotal variance explained by first 2 components: {cumulative_variance[1]:.2%}")
Component 1: 0.1193 variance explained Component 2: 0.1171 variance explained Component 3: 0.1121 variance explained Component 4: 0.1111 variance explained Component 5: 0.1081 variance explained Total variance explained by first 2 components: 23.65%
Graph 22 - Explained Variance by Principal Components
The first two components explain approximately 23.6% of the total variance, meaning that a 2D projecting captures about one-quarter of the overall data structure, sufficient for visual inspection but not full reconstruction.
The cumulative variance rises sharply across the first few components, reaching about 55%-60% by the 5th component, and flattens around the 7-10 components, where most of the meaningful variance has already been captured. Beyond roughly 20 components, the gain in explained variance becomes negligible, indicating that the majority of variability in the encoded dataset can be effectively represented in a 10-20 dimensional subspace instead of the full 32 dimensions
# Reduce to 2D
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_prepared)
# plot
plt.figure(figsize=(8,6))
plt.scatter(
X_pca_2d[:, 0],
X_pca_2d[:, 1],
c=y, cmap='coolwarm', s=2, alpha=0.6
)
plt.title("PCA Projection (2 Components) - Fraud vs. Non Fraud")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label='is_fraud (0 = Non-Fraud, 1 = Fraud)')
plt.tight_layout()
plt.show()
Graph 23 - PCA projection
The 2D PCA projection displays transactions according to their principal component coordinates, where each point represents a single transaction
- Blue points correspond to legitimate transactions (
is_fraud = 0)- Red points represent fraudulent transactions (
is_fraud = 1)Each axis (Principal Component 1 and 2) captures the directions of highest variance in the dataset after encoding and scaling, essentially, the two most informative linear combinations of all numerical, categorical and cyclical features.
Based on the visualization above, we can conclude the following:
The overall pattern forms distinct horizontal bands or stripes, which arise due to dominant structured variables such as:
Cyclical time features (
hour,day_of_week,month) that repeat periodicallyCategorical encodings (
category,state,job) that introduce discrete variance stepsThis structured appearance means that data is highly organized and not random. Transactions exhibit repetitive behavioral patterns (e.g., daily purchasing cycles, consistent merchant categories, or recurring spending behavior).
The fraudulent transactions (red points) are sparse and dispersed throughout the legitimate clusters. They do not form any clear or isolated cluster, instead, they blend into the dense blue regions. This indicates that:
Fraudulent activity does not create a distinct high-variance direction that PCA can easily separate
Fraud behavior is embedded within legitimate transaction space, mimicking normal user patterns
This supports many of the observations we have witnessed in the feature exploration, where fraudulent transaction were intentionally designed to look legitimate.
We know that the first two components capture only 23.6% of the total variance in the dataset. While this is enough to provide a broad visualization of transaction patterns, it does not represent the full complexity of the data. Therefore, this 2D projection should be seen as a compressed illustration, not as a complete separation of behavioral dynamics.
💡 Note:
The fact that fraudulent and non-fraudulent transactions overlap heavily in this projection, suggests that fraud cannot be linearly separated in the feature space. This highlights the need for **non-linear techniques*** such as t-SNE
➿ t-SNE¶
While PCA captures global, linear variance, it may miss subtle local relationships hidden within the high-dimensional feature space. To uncover these non-linear patterns, we apply t-SNE (t-Distributed Stochastic Neighbor Embedding) a non-linear manifold learning technique designed to preserve local neighborhoods, meaning, points that are close in high-dimensional space remain close in the 2D embedding.
This would make t-SNE particularly effective for visualizing the structure, subtle clusters, outliers, and features that are often missed by PCA, especially when fraudulent transactions represent small, context-specific anomalies hidden within legitimate activity
# Because t-SNE is computationally expensive, we take a sample
"""
NOTE: Sampling does not distort overall structure because the dataset is large
and well-distributed. 10,000 points are sufficient to approximate global behavior
while keeping runtime manageable
"""
sample_size = 10000 # can be adjusted if we need finer detail
X_sample = X_prepared[:sample_size]
y_sample = y[:sample_size]
# Initialize t-SNE
start_time = time.time()
tsne = TSNE(
n_components=2,
perplexity=70, # balances local/global structure
learning_rate=300, # moderate, prevents local "worming"
max_iter=1500, # allows convergence
init='pca', # smoother start, perserves global layout
random_state=42,
verbose=1
)
X_tsne = tsne.fit_transform(X_sample)
print(f"t-SNE completed in {time.time() - start_time:.2f} seconds")
[t-SNE] Computing 211 nearest neighbors... [t-SNE] Indexed 10000 samples in 0.001s... [t-SNE] Computed neighbors for 10000 samples in 2.259s... [t-SNE] Computed conditional probabilities for sample 1000 / 10000 [t-SNE] Computed conditional probabilities for sample 2000 / 10000 [t-SNE] Computed conditional probabilities for sample 3000 / 10000 [t-SNE] Computed conditional probabilities for sample 4000 / 10000 [t-SNE] Computed conditional probabilities for sample 5000 / 10000 [t-SNE] Computed conditional probabilities for sample 6000 / 10000 [t-SNE] Computed conditional probabilities for sample 7000 / 10000 [t-SNE] Computed conditional probabilities for sample 8000 / 10000 [t-SNE] Computed conditional probabilities for sample 9000 / 10000 [t-SNE] Computed conditional probabilities for sample 10000 / 10000 [t-SNE] Mean sigma: 0.671876 [t-SNE] KL divergence after 250 iterations with early exaggeration: 75.412872 [t-SNE] KL divergence after 1500 iterations: 0.725678 t-SNE completed in 182.67 seconds
# Visualization
plt.figure(figsize=(8,6))
plt.scatter(
X_tsne[:, 0],
X_tsne[:, 1],
c=y_sample,
cmap='coolwarm',
s=5,
alpha=0.6
)
plt.title("t-SNE Projection (2 Components) - Fraud vs. Non-Fraud")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.colorbar(label="is_fraud (0 = Non-Fraud, 1 = Fraud)")
plt.tight_layout
plt.show()
Graph 24 - t-SNE projection
The visualization above shows the t-SNE projection of 10,000 randomly sampled transactions:
Blue points remain the legitimate transactions (
is_fraud = 0)Red points remain the fraudulent transactions (
is_fraud = 1)Each point corresponds to a transaction embedded in a 2D space that preserves local similarity from the original 32-dimensional feature space.
The resulting map forms a series of compact, rounded clusters, each representing transactions that share similar behavioral or contextual properties. For example, purchases from similar merchant types, time patterns, or geographical regions.
This clustered but continuous stucture indicates that transaction behaviors are highly organized, reflecting consistent real-world patterns such as daily routines or repeated merchant interactions.
Fraudulent transactions are scattered within these clusters, showing no isolated or unique grouping. Instead, they are interspersed among legitimate data points, implying the same we've observed in the PCA analysis - fraudulent behavior closely mimics normal transactional patterns, at least within certain contexts.
Therefore, while t-SNE helps confirm that the dataset exhibits strong natural structure, it also highlights the subtle and embedded nature of fraud, justifying the need for supervised and non-linear models to effectively detect such hidden anomalies.
⭕ K-Means¶
K-Means is a simple yet powerful algorithm for discovering latent groupings within data. While PCA and t-SNE focus on visualization, K-Means explicitly partitions the dataset into k clusters, minimizing within-cluster variance.
We would want to identify potential behavioral clusters that capture recurring transaction patterns, and examine whether fraudulent transactions concentrate in any cluster or are spread throughout - which helps us understand the nature of fraudulent behavior.
# We'll work on a sample (Since K-Means scales poorly with millions of points)
sample_size = 20000
X_sample = X_prepared[:sample_size]
y_sample = y[:sample_size]
# determine optimal number of clusters using elbow method
inertias = []
K_range = range(2,21)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_sample)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(7,4))
plt.plot(K_range, inertias, marker='o')
plt.title("Elbow Method - Optimal Number of Clusters (K)")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.grid(True)
plt.tight_layout()
plt.show()
Graph 25 - Elbow Method (K-Means)
The extended elbow analysis up to K=20 shows a smooth, monotonically decreasing inertia curve, with no abrupt "elbow" point.
However, the rate of improvement flattens notably beyond K ≈ 10-12, suggesting that most of the structural variance in the data is captured by this range. Increasing K beyond 12 yields only marginal gains, indicating diminishing returns and potential over-segmentation.
Therefore, we consider K = 10 as a practical trade-off between cluster compactness and interpretability.
kmeans_final = KMeans(n_clusters=10, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_sample)
# Evaluate
silhouette_avg = silhouette_score(X_sample, cluster_labels)
print(f"Average Silhouette Score: {silhouette_avg:.3f}")
# Add cluster assignments to the data
df_clusters = pd.DataFrame({
'cluster': cluster_labels,
'is_fraud': y_sample
})
Average Silhouette Score: 0.137
The K-Means model achieved an average silhouette score of 0.137, indicating weakly separated clusters with substantial overlap.
This low score suggests that while some latent structure exist in the transaction space, the boundaries between clusters are not well-defined, as expected in financial data where fraudulent behavior is intentionally blended within legitimate activity patterns.
Fraud Distribution per Cluster:
fraud_ratio = df_clusters.groupby('cluster')['is_fraud'].mean().sort_values(ascending=False)
fraud_counts = df_clusters.groupby('cluster')['is_fraud'].sum()
total_counts = df_clusters['cluster'].value_counts().sort_index()
fraud_summary = pd.DataFrame({
'Total Transactions': total_counts,
'Fraudulent Transactions': fraud_counts,
'Fraud Ratio (%)': (fraud_ratio * 100).round(3)
})
fraud_summary['Fraud Ratio (%)'].plot(kind='bar', figsize=(8,4), color='tomato', alpha=0.7)
plt.title("Fraud Ratio (%) Across K-Means Clusters")
plt.xlabel("Cluster")
plt.ylabel("Fraud Ratio (%)")
plt.grid(True, axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
Graph 26 - Fraud ratio across K-Means Clusters
The chart above shows that fraudulent transactions are unevenly distributed across the 10 clusters. Clusters 0 and 1 exhibit notably higher fraud ratios (around 1% - 1.75%), while most others remain below 0.5%.
This pattern suggests that ceratin behavioral groups carry elevated fraud risk, yet no cluster is dominated by fraud, reinforcing what we've seen in PCA and t-SNE sections.
Conclusion¶
Unsupervised exploration revealed no clear separation between fraudulent and legitimate activity. PCA and t-SNE showed that fraud cases are deeply embedded within normal patterns, while K-Means clustering confirmed only weak separability. These findings highlight that fraud detection in this dataset requires supervised, non-linear modeling capable of capturing the subtle and context-dependent signals hidden in legitimate behavior.
EDA Conclusion¶
The dataset is clean, diverse, and behaviorally rich, providing a strong foundation for fraud detection modeling. Although fraud cases are rare, they exhibit distinct temporal, monetary, and categorical patterns, particularly acrosss time-of-day, transaction amount, and merchant category features.
Key predictive drivers include amt, temporal features like hour, day_of_week, month, and merchant context, while demographic and geographic variables add complementary insights into spending diversity and fraud exposure.
Features such as gender and job require cautious use to avoid bias or overfitting due to sparsity, and redundant features have been safely removed.
Overall, the dataset demonstrates excellent structure, and realistic behavioral consistency, we are ready to move on to the next section, where we explore additional features that we can engineer to further improve training of different supervised models
Feature Engineering¶
Feature engineering is the process of creating, transforming, or selecting variables in order to improve the predictive power of machine learning models.
In our analysis, we've already identified which raw fields are useful and how they can be transformed into meaningful signals for fraud detection.
However, we can still apply some feature engineering to create even better features that will enrich the data and make it more useful for the case of the training of the model
Let us first see the linear correlation between the features we have in the dataset. It will allow us to better understand which features are more reliable for our cause and which are redundant
# Keep only numeric columns for correlation
corr = df_train.corr(numeric_only=True)
plt.figure(figsize=(10, 6))
plt.imshow(corr, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label="Correlation")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Heatmap of Numeric Features")
plt.tight_layout()
plt.show()
Graph 27 - Correlation Heatmap
The correlation matrix shows that most numeric features in the dataset are weakly correlated with each other, meaning they provide largely independent information, which is good for machine learning models.
Let us now engineer a few more features that will add extra robustness to the dataset:
📓 card_had_prev_fraud¶
This is a new feature that indicates whether the credit card involved in the current transaction has ever been used in a fraudulent transaction before (based on past data).
For each card, transactions are sorted chronologically, and the feature is set to True only if the same card had a confirmed fraud prior to this transaction - never using any future information. This ensures the feature is completely time-safe and free from data leakage.
We include this feature because cards with a known fraud history are much more likely to commit fraud again, making this a strong behavioral signal for the model. It helps the algorithm capture repeat-offender patterns that are not easily reflected by other transactional or demographic attributes.
df_train = df_train_raw.sort_values(['cc_num', 'date']).copy()
df_train['card_had_prev_fraud'] = (
df_train.groupby('cc_num')['is_fraud']
.transform(lambda x: x.shift().cummax().fillna(0))
.astype(bool)
)
fraud_cards_train = set(df_train.loc[df_train['is_fraud'] == 1, 'cc_num'])
df_test['card_had_prev_fraud'] = df_test['cc_num'].isin(fraud_cards_train)
In order to verify that the new feature was created correctly, we will add a sanity check:
For every transaction where card_had_prev_fraud == True, we verify that there is indeed a prior fraudulent transaction for the same card. Similarly, we confirm that no card is marked as having no prior frauds when in fact it does.
This step helps guarantee that the feature was implemented correctly and that no data leakage or logical inconsistencies were introduced during feature creation.
# Ensure chronological order per card
df_train = df_train.sort_values(['cc_num', 'date']).copy()
# Initialize counters
errors_flagged = 0
# Loop through each card and check consistency
for card, group in df_train.groupby('cc_num'):
# Compute the true "previous fraud" flag from scratch
true_prev_fraud = group['is_fraud'].shift().cummax().fillna(0).astype(bool)
# Compare to our feature
if not (true_prev_fraud == group['card_had_prev_fraud']).all():
errors_flagged += 1
print(f"Inconsistency found for card: {card}")
display(pd.concat([group[['date', 'is_fraud', 'card_had_prev_fraud']],
true_prev_fraud.rename('true_prev_fraud')], axis=1).head(10))
if errors_flagged == 0:
print("All cards consistent: every 'card_had_prev_fraud' flag is correct.")
else:
print(f"{errors_flagged} cards had mismatched flags — investigate above.")
All cards consistent: every 'card_had_prev_fraud' flag is correct.
We can see that all cards are consistent, which is a great sign. Let us now evaluate how useful the feature really is:
Feature Observation
fraud_rate_by_flag = df_train.groupby('card_had_prev_fraud')['is_fraud'].mean().reset_index()
fraud_rate_by_flag['is_fraud'] *= 100 # Convert to %
fraud_rate_by_flag.rename(columns={'is_fraud' : 'Fraud Rate (%)'}, inplace=True)
plt.figure(figsize=(5,4))
plt.bar(
fraud_rate_by_flag['card_had_prev_fraud'].astype(str),
fraud_rate_by_flag['Fraud Rate (%)'],
color=['#4C72B0', '#C44E52']
)
plt.title("Fraud Rate by Card's Fraud History")
plt.ylabel("Fraud Rate (%)")
plt.xlabel("Card Had Previous Fraud")
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Graph 28 - Fraud Rate by Card's Fraud History
The finding is very encouraging. Despite the dataset's extreme class imbalance, the
card_had_prev_fraudfeature shows a massive behavioral separation: cards with prior fraud history exhibit a fraud rate of ~1.3%, nearly ten times higher than normal cards. This is an extraordinary signal given how rare fraud is overall. In practical terms, the model can instantly identify a subset of transactions where the baseline fraud probability skyrockets, a "high-alert" flag that mimics real-world risk scoring systems used by banks and big corporations
🌡 card_prev_fraud_ratio¶
While card_had_prev_fraud provides a binary signal indicating whether a card has ever been involved in fraud, it does not capture how frequently fraudulent behavior has occurred relative to the card's overall activity.
The feature card_prev_fraud_ratio addresses this limitation by representing the proportion of previous fraudulent transactions out of all past transactions for a given card.
This ratio gives the model a more nuanced, continuous measure of risk, cards with one fraud in 100 transactions behave differently from those with one in three.
df_train = df_train.sort_values(['cc_num', 'date']).copy()
# Count prior frauds and transactions per card
df_train['prev_fraud_count'] = df_train.groupby('cc_num')['is_fraud'].cumsum().shift().fillna(0)
df_train['prev_txn_count'] = df_train.groupby('cc_num').cumcount()
# Ratio of prior frauds
df_train['card_prev_fraud_ratio'] = df_train['prev_fraud_count'] / df_train['prev_txn_count'].replace(0, 1)
# Compute fraud ratio per card from training data only
card_stats = (
df_train.groupby('cc_num')['is_fraud']
.agg(['sum', 'count'])
.rename(columns={'sum': 'train_fraud_count', 'count': 'train_total_count'})
)
card_stats['card_prev_fraud_ratio'] = card_stats['train_fraud_count'] / card_stats['train_total_count']
# Merge into test with the same column name
df_test = df_test.merge(card_stats[['card_prev_fraud_ratio']], on='cc_num', how='left')
df_test['card_prev_fraud_ratio'] = df_test['card_prev_fraud_ratio'].fillna(0)
# Drop 'prev_txn_count' and 'card_prev_fraud_ratio' from train
df_train.drop(columns=['prev_txn_count', 'prev_fraud_count', 'date'], inplace=True)
Feature Observation
plt.figure(figsize=(8, 5))
sns.boxplot(
data=df_train,
x='is_fraud',
y='card_prev_fraud_ratio',
hue='is_fraud',
palette=['#4CAF50', '#E53935'],
legend=False
)
plt.yscale('log')
plt.title("Distribution of card_prev_fraud_ratio by Fraud Label (Log Scale)", fontsize=14)
plt.xlabel("Fraud Label (0 = Legit, 1 = Fraud)", fontsize=12)
plt.ylabel("Card Previous Fraud Ratio (log scale)", fontsize=12)
plt.grid(True, linestyle="--", alpha=0.4)
plt.show()
Graph 29 - Distribution of card_prev_fraud_ratio
The boxplot shows that transactions labeled as fraud (1) tend to have noticeably higher previous-fraud ratios than legitimate ones (0), even after applying a logarithmic scale.
This pattern suggests that cards with a history of fraudulent behavior are more likely to be used in new fraudulent transactions. Therefore,
card_prev_fraud_ratiois a meaningful and predictive feature that provides strong behavioral signal
⏰ Temporal Flag Features¶
To capture behavioral patterns tied to time, we created a utility function:
def add_time_flags(df):
if 'hour' not in df.columns or 'day_of_week' not in df.columns:
raise KeyError("DataFrame must contain 'hour' and 'day_of_week' columns")
# Convert to numeric
df['hour'] = pd.to_numeric(df['hour'], errors='coerce').fillna(-1)
# Handle 'day_of_week' as text or numeric
if df['day_of_week'].dtype == 'object':
# Map weekday names to numbers
day_map = {
'Monday': 0, 'Tuesday' : 1, 'Wednesday' : 2,
'Thursday' : 3, 'Friday' : 4, 'Saturday' : 5,
'Sunday' : 6
}
df['day_of_week_num'] = df['day_of_week'].map(day_map)
else:
df['day_of_week_num'] = pd.to_numeric(df['day_of_week'], errors='coerce')
# Create flags
df['is_night'] = np.where((df['hour'] >= 22) | (df['hour'] < 6), 1, 0)
df['is_weekend'] = np.where(df['day_of_week_num'] >= 5, 1, 0)
df.drop(columns=['day_of_week_num'], inplace=True)
return df
df_train = add_time_flags(df_train)
df_test = add_time_flags(df_test)
This function adds two binary flag features:
is_night- since night-time activity may indicate higher fraud risk (as seen in the EDA)is_weekend- weekend spending patterns often differ from weekday activity, which might help the model detect unusual transaction timing.
By incorporating these temporal flags, the model can learn contextual cues about transaction timing, which frequently improves fraud detection performance.
Features Observation
# Compute fraud rate by each flag
night_fraud_rate = df_train.groupby('is_night')['is_fraud'].mean().reset_index()
weekend_fraud_rate = df_train.groupby('is_weekend')['is_fraud'].mean().reset_index()
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Fraud Rate by Night vs. Day
sns.barplot(
data=night_fraud_rate,
x='is_night', y='is_fraud',
hue='is_night',
palette={0: '#4CAF50', 1: '#E53935'},
legend=False,
ax=axes[0]
)
axes[0].set_title("Fraud Rate by Night vs. Day", fontsize=14)
axes[0].set_xlabel("Is Night", fontsize=12)
axes[0].set_ylabel("Fraud Rate (%)", fontsize=12)
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(["Day", "Night"])
axes[0].grid(True, linestyle="--", alpha=0.4)
# Annotate fraud rates on top of bars
for i, row in night_fraud_rate.iterrows():
axes[0].text(
i,
row['is_fraud'] + 0.0002,
f"{row['is_fraud']:.2%}",
ha='center',
va='bottom',
fontsize=10
)
# Fraud Rate by Weekend vs. Weekday
sns.barplot(
data=weekend_fraud_rate,
x='is_weekend', y='is_fraud',
hue='is_weekend',
palette={0: '#4CAF50', 1: '#E53935'},
legend=False,
ax=axes[1]
)
axes[1].set_title("Fraud Rate by Weekend vs. Weekday", fontsize=14)
axes[1].set_xlabel("Is Weekend", fontsize=12)
axes[1].set_ylabel("Fraud Rate (%)", fontsize=12)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(["Weekday", "Weekend"])
axes[1].grid(True, linestyle="--", alpha=0.4)
# Annotate fraud rates on top of bars
for i, row in weekend_fraud_rate.iterrows():
axes[1].text(
i,
row['is_fraud'] + 0.0002,
f"{row['is_fraud']:.2%}",
ha='center',
va='bottom',
fontsize=10
)
# Make y-axis show percentages instead of fractions
for ax in axes:
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
plt.tight_layout()
plt.show()
Graph 30 - Fraud Rate Based on New Temporal Features
The temporal analysis shows that
is_nightis a strong indicator of fraudulent activity, with night transactions being approximately fifteen times more likely to be fraudulent compared to daytime ones.In contrast,
is_weekendshows little distinction between fraud and legitimate transactions, suggesting that the timing of the day is a far stronger fraud signal than the day of the week. However, it might still matter in interaction terms (for instance, weekend and night transactions could have a specific risk pattern)
Let us now observe the correlation matrix again, to see whether the engineered features relate to the target variable/other features in any way:
# Keep only numeric columns for correlation
corr = df_train.corr(numeric_only=True)
plt.figure(figsize=(10, 6))
plt.imshow(corr, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label="Correlation")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Heatmap of Numeric Features")
plt.tight_layout()
plt.show()
Graph 31 - Updated Correlation Heatmap
The updated correlation matrix confirms that the engineered features contribute new, non-redundant information related to fraud detection, while the dataset remains structurally balanced and free from major multicollinearity issues
Conclusion¶
The feature engineering process enriched the dataset with several informative attributes that capture both temporal and behavioral aspects of transaction patterns.
Features such as card_prev_fraud_ratio and card_had_prev_fraud introduce valuable historical context, while is_night adds a meaningful temporal dimension that helps distinguish fraudulent behavior. At the same time, other variables like is_weekend and location-based measures contribute complementary perspectives, even if their direct correlation with fraud is weaker.
Having validated the importance and stability of these features, we are now ready to proceed to the model training and evaluation phase, where these variables will be leveraged to build predictive fraud detection models
Training Models¶
Preprocessing Stage¶
Before training our supervised models, we design a consistent and leakage-free preprocessing strategy that prepares every column appropriately according to its nature and predictive value.
Categorical features will be divided into three groups, each handled differently:
Low-cardinality features (
category,gender) - just like in unsupervised learning, they will be One-Hot Encoded to preserve interpretability and allow models to learn clear group boundariesHigh-cardinality features (
merchant,job,city,state) will be fraud rate encoded, where each category is replaced with its average fraud rate in the training set. This provides target-aware information (categories more prone to fraud get higher values) while keeping dimensionality low.Unique or identifier-like features (
cc_num) will be frequency-encoded, representing how active each card is without leaking fraud labels. This will help models detect unusual card behavior (e.g., a card suddenly used far more often)
Numeric features (continuous variables such as amt, city_pop, age, ...) will be scaled using MinMaxScaler to bring all values into a uniform [0,1] range. This prevents features with larger numeric ranges (like population or amount) from dominating others during model training.
Temporal features will be encoded using sine and cosine transformations (as we have seen in unsupervised learning section). This will ensure that the model understand time as a continuous, circular variable.
Flag features (is_night, is_weekend, card_had_prev_fraud) are already binary indicators (0 or 1) and therefore require no further transformation. They will be passed through the pipeline as they are, because their numeric format is already suitable for model training
class FraudRateEncoder(BaseEstimator, TransformerMixin):
"""
Encodes categorical features based on their historical fraud rate
"""
def __init__(self, min_samples: int = 1, smoothing: float = 0.0, dtype: str = "float32"):
self.min_samples = min_samples
self.smoothing = smoothing
self.dtype = dtype
self.category_stats_ = None
self.global_rate_ = None
self.feature_name_out_ = None
self._in_name = None
def fit(self, X, y):
# X is a single-column array/dataframe, y is is_fraud (0/1)
x_series = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else pd.Series(X.ravel())
y_series = pd.Series(y).astype(float)
self._in_name = X.columns[0] if isinstance(X, pd.DataFrame) else 'col'
# Group stats
grp = pd.DataFrame({"x": x_series, "y": y_series}).groupby("x")["y"].agg(["mean", "count"])
self.global_rate_ = y_series.mean()
if self.smoothing > 0:
# m-estimate smoothing toward global_rate
smooth_num = grp["mean"] * grp["count"] + self.smoothing * self.global_rate_
smooth_den = grp["count"] + self.smoothing
rate = smooth_num / smooth_den
else:
rate = grp["mean"]
# Enforce min_samples fallback to global
rate = rate.where(grp["count"] >= self.min_samples, self.global_rate_)
self.category_stats_ = rate
self.feature_name_out_ = f"{self._in_name}_fraud_rate"
return self
def transform(self, X):
x_series = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else pd.Series(X.ravel())
out = x_series.map(self.category_stats_).fillna(self.global_rate_).astype(self.dtype)
return out.to_numpy().reshape(-1, 1)
def get_feature_names_out(self, input_features=None):
return np.array([self.feature_name_out_], dtype=object)
class CardFrequencyEncoder(BaseEstimator, TransformerMixin):
"""
Encodes card numbers (cc_num) by their transaction frequency in the dataset.
This represents how active each card is, without leaking fraud information
Unlike `FraudRateEncoder`, this encoder is label-agnostic and purely structural.
It's ideal for 'cc_num' that uniquely identify entities
"""
def __init__(self, new_col_name: str = "cc_freq"):
self.new_col_name = new_col_name
self.freq_map_ = None
def fit(self, X, y=None):
# Expecting a single column
X = pd.DataFrame(X).copy()
col = X.columns[0]
self.freq_map_ = X[col].value_counts().to_dict()
return self
def transform(self, X):
X = pd.DataFrame(X).copy()
col = X.columns[0]
out = X[col].map(self.freq_map_).fillna(0).astype("int64")
return out.to_numpy().reshape(-1,1)
def get_feature_names_out(self, input_features=None):
return np.array([self.new_col_name], dtype=object)
low_card_cols = ["category", "gender"] # One-Hot
fraud_rate_cols = ["merchant", "job", "city", "state"] # Fraud Rate Encoding
card_col = ["cc_num"] # Frequency Encoding
num_cols = ["amt", "city_pop", "distance_cardholder_merchant", "age", "card_prev_fraud_ratio"] # Scaled numeric
time_cols = ["hour", "day_of_week", "month"] # Cyclical encoding
preprocess = ColumnTransformer(
transformers=[
# Fraud Rate Encoders for high-cardinality categoricals
("merchant_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["merchant"]),
("job_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["job"]),
("city_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["city"]),
("state_rate", FraudRateEncoder(min_samples=100, smoothing=100), ["state"]),
# Frequency Encoding for card number
("card_freq", CardFrequencyEncoder(new_col_name="cc_freq"), card_col),
# One-Hot Encoding for low-cardinality categoricals
("onehot_low", OneHotEncoder(handle_unknown="ignore", sparse_output=False), low_card_cols),
# Cyclical Encoding for time-based features
("cyclical_time", CyclicalTimeEncoder(period_map={
"hour": 24,
"day_of_week": 7,
"month": 12
}), time_cols),
# Min-Max Scaling for numeric features
("scaler", MinMaxScaler(), num_cols)
],
remainder="drop",
verbose_feature_names_out=False
)
Next, we will use a seed setup to ensure identical sampling in SMOTE, and the same model weights initialization - so every rerun will give the same metrics:
def set_global_seed(seed: int = 42):
"""
Sets random seed for reproducibility across Python, Numpy and PyTorch.
Ensures deterministic behavior for CUDA when available
"""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_global_seed(42)
To create comfortable, and neat visualizations for model comparisons, we will create helper functions:
compare_two_models
The function compares the results of 2 given models.
It focuses on recall, precision, F1 and ROC-AUC. The results given by the function will be analyzed and interpreted in the later stages of the section
def compare_two_models(y_true, preds_1, probs_1, preds_2, probs_2,
model_names=('Model 1', 'Model 2')):
"""
Compares two models by given evaluation metrics,
visualizes results as a bar chart
"""
# Compute metrics
metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']
scores_1 = [
precision_score(y_true, preds_1, zero_division=0),
recall_score(y_true, preds_1, zero_division=0),
f1_score(y_true, preds_1, zero_division=0),
roc_auc_score(y_true, probs_1)
]
scores_2 = [
precision_score(y_true, preds_2, zero_division=0),
recall_score(y_true, preds_2, zero_division=0),
f1_score(y_true, preds_2, zero_division=0),
roc_auc_score(y_true, probs_2)
]
# Display table
results_df = pd.DataFrame({
'Metric': metrics,
model_names[0]: np.round(scores_1, 3),
model_names[1]: np.round(scores_2, 3)
})
print(results_df.to_string(index=False))
print()
# Bar chart
x = np.arange(len(metrics))
width = 0.35
fig, ax = plt.subplots(figsize=(8, 5))
bars1 = ax.bar(x - width/2, scores_1, width, label=model_names[0])
bars2 = ax.bar(x + width/2, scores_2, width, label=model_names[1])
# Annotate bars
for bars in [bars1, bars2]:
for bar in bars:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=9)
ax.set_ylabel('Score')
ax.set_ylim(0, 1.05)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.set_title('Model Comparison: Key Metrics')
ax.legend()
ax.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
compare_three_models
The function compares the results of 3 given models (similarly to the previous helper function).
def compare_three_models(y_true,
preds_1, probs_1,
preds_2, probs_2,
preds_3, probs_3,
model_names=('Model 1', 'Model 2', 'Model 3')):
"""
Compares three models by key evaluation metrics
and visualizes results as a grouped bar chart.
"""
# Define metrics
metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']
# Helper function to compute scores for each model
def compute_scores(preds, probs):
return [
precision_score(y_true, preds, zero_division=0),
recall_score(y_true, preds, zero_division=0),
f1_score(y_true, preds, zero_division=0),
roc_auc_score(y_true, probs)
]
# Compute scores
scores_1 = compute_scores(preds_1, probs_1)
scores_2 = compute_scores(preds_2, probs_2)
scores_3 = compute_scores(preds_3, probs_3)
# Create DataFrame
results_df = pd.DataFrame({
'Metric': metrics,
model_names[0]: np.round(scores_1, 3),
model_names[1]: np.round(scores_2, 3),
model_names[2]: np.round(scores_3, 3)
})
# Display table
print(results_df.to_string(index=False))
print()
# Plot bar chart
x = np.arange(len(metrics))
width = 0.25
fig, ax = plt.subplots(figsize=(9, 5))
bars1 = ax.bar(x - width, scores_1, width, label=model_names[0])
bars2 = ax.bar(x, scores_2, width, label=model_names[1])
bars3 = ax.bar(x + width, scores_3, width, label=model_names[2])
# Annotate bars
for bars in [bars1, bars2, bars3]:
for bar in bars:
ax.text(bar.get_x() + bar.get_width()/2,
bar.get_height() + 0.015,
f'{bar.get_height():.2f}',
ha='center', va='bottom', fontsize=9)
ax.set_ylabel('Score')
ax.set_ylim(0, 1.05)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.set_title('Model Comparison: Key Metrics')
ax.legend()
ax.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
plot_model_performance
The function visualizes the performance of trained model by focusing on the same key evaluation metrics as the previous helper function. It provides a clear overview of model behavior through a detailed confusion matrix and ROC curve, allowing intuitive assessment of classification accuracy and discriminative power
def plot_model_performance(y_true, y_pred, y_proba, model_name="Model"):
"""
Plots the confusion matrix, ROC curve, and metric summary for a classification model.
"""
# Compute metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
roc_auc = roc_auc_score(y_true, y_proba)
# Create 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
y_true,
y_pred,
ax=axes[0],
cmap="Blues",
colorbar=False
)
axes[0].set_title(f"Confusion Matrix ({model_name})", fontsize=12)
# ROC curve
RocCurveDisplay.from_predictions(
y_true,
y_proba,
ax=axes[1],
name=model_name
)
axes[1].plot([0, 1], [0, 1], "--", color="gray")
axes[1].set_title("ROC Curve", fontsize=12)
# Metrics Bar Chart
metrics = ["Precision", "Recall", "F1-score", "ROC-AUC"]
values = [precision, recall, f1, roc_auc]
axes[2].bar(metrics, values, color=["#4c72b0", "#55a868", "#c44e52", "#8172b3"])
axes[2].set_ylim(0, 1)
axes[2].set_ylabel("Score")
axes[2].set_title("Performance Metrics", fontsize=12)
# Display exact values on top of bars
for i, v in enumerate(values):
axes[2].text(i, v + 0.02, f"{v:.3f}", ha="center", fontsize=10)
plt.tight_layout()
plt.show()
With the preprocessing pipeline fully established and all features consistently prepared, we are now ready to proceed to the model training phase
Models¶
In this stage, we focus on developing and evaluating a set of supervised machine learning models for fraud detection.
Here is the overview of the selected models:
| Model | How It Works | Why We Use It |
|---|---|---|
| Logistic Regression | A linear model that estimates the probability of fraud using a weighted combination of features and a sigmoid activation function. | Serves as a baseline for performance comparison; simple, interpretable, and fast to train. |
| Random Forest | An ensemble of decision trees built on random feature subsets, combining their outputs through majority voting. | Handles non-linear relationships and feature interactions effectively while being robust to noise and overfitting. |
| XGBoost | A gradient boosting algorithm that builds trees sequentially, each correcting the errors of the previous one. | Known for high predictive accuracy, speed, and built-in regularization; ideal for structured data. |
| Neural Network | A multi-layered model of interconnected neurons that learns complex patterns through non-linear transformations. | Capable of capturing deep, non-linear relationships between features and generalizing across diverse patterns. |
| TabNetClassifier | A deep learning architecture specifically designed for tabular data, using sequential attention to focus on the most relevant features at each decision step. | Combines the strengths of deep learning with interpretable feature selection, often outperforming traditional models on structured data. |
Each model is trained on the preprocessed dataset using the same feature transformations to ensure fairness and consistency across evaluations.
Given the extreme class imbalance in fraudulent transactions, we will apply SMOTE to oversample the minority class and improve the model's ability to recognize rare fraud cases.
Model performance will be assessed primarily using recall and ROC-AUC:
Recall is prioritized to minimize false negatives, as missing a fraudulent transaction carries the highest cost
ROC-AUC provides a comprehensive measure of each model's discriminative capability across various thresholds.
By comparing these models under identical conditions, we aim to identify the one that achieves the best balance between fraud detection sensitivity and overall predictive performance
🧰 Logistic Regression (Baseline)¶
Training¶
We start by training the baseline model using the original, imbalanced dataset to establish a performance benchmark.
Next, we apply SMOTE to generate synthetic samples for the minority class, thereby balancing the dataset.
Finally, the model is retrained on the resampled data, allowing us to compare results and assess the impact of class balancing on overall predictive performance
# Version without SMOTE
# Define model
lg = LogisticRegression(random_state=42, max_iter=1000)
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
# Build pipeline
steps = [
("preprocess", preprocess),
("lg", lg)
]
pipe = Pipeline(steps)
pipe.fit(X_train, y_train)
# Predict on test set (no SMOTE here — only real test data)
y_pred_lg = pipe.predict(X_test)
y_proba_lg = pipe.predict_proba(X_test)[:, 1]
# Evaluate performance
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lg))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lg))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_lg))
Confusion Matrix:
[[553574 0]
[ 2145 0]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 553574
1 0.00 0.00 0.00 2145
accuracy 1.00 555719
macro avg 0.50 0.50 0.50 555719
weighted avg 0.99 1.00 0.99 555719
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
ROC-AUC Score: 0.8537488669832314
💡 Note on warning:
During evaluation, several UndefinedMetricWarning messages were displayed by scikit-learn. These warnings indicate that the model did not predict any instances of the positive class (fraudulent transactions). In other words, since all predictions were labeled as non-fraud, there were no "positive" predictions to calculate precision or F1-score from, resulting in undefined metrics automatically set to zero.
This behavior is expected and logical in extremely imbalanced datasets, where the model initially learns to favor the dominant class (non-fraud) to minimize overall error. Once class imbalance is addressed, these warnings naturally disappear as the model begins predicting fraud cases correctly
Now we test the same model with SMOTE.
After carefully testing different SMOTE values, we found that increasing the minority class to 20% produces the best results.
# Version with SMOTE
# Define model
lg = LogisticRegression(random_state=42, max_iter=10000)
# Define SMOTE
smote = SMOTE(
sampling_strategy=0.2, # minority class will be 20% of majority
random_state=42,
k_neighbors=5
)
# Build pipeline using imblearn.Pipeline
steps = [
("preprocess", preprocess),
("smote", smote),
("lg", lg)
]
pipe = Pipeline(steps)
# Fit the pipeline (SMOTE applied only on training since we used imblearn's pipeline)
pipe.fit(X_train, y_train)
# Predict on test set (no SMOTE here - only real test data)
y_pred_lg_smote = pipe.predict(X_test)
y_proba_lg_smote = pipe.predict_proba(X_test)[:, 1]
Results¶
without SMOTE:
plot_model_performance(y_test, y_pred_lg, y_proba_lg, model_name="Baseline Logistic Regression")
/usr/local/lib/python3.12/dist-packages/sklearn/metrics/_classification.py:1565: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
with SMOTE:
plot_model_performance(y_test, y_pred_lg_smote, y_proba_lg_smote, model_name="Logistic Regression (SMOTE)")
compare_two_models(
y_true=y_test,
preds_1=y_pred_lg,
probs_1=y_proba_lg,
preds_2=y_pred_lg_smote,
probs_2=y_proba_lg_smote,
model_names=('Logistic Regression', 'Logistic Regression + SMOTE')
)
Metric Logistic Regression Logistic Regression + SMOTE Precision 0.00 0.19 Recall 0.00 0.62 F1-score 0.00 0.29 ROC-AUC 0.85 0.93
Model Performance
Before applying SMOTE, the Logistic Regression model completely failed to detect fraudulent transactions.
The precision, recall, and F1-score for the fraud class were all zero, indicating that the model predicted every transaction as legitimate. This outcome highlights the impact of severe class imbalance, the model learned to favor the dominant non-fraud class while entirely ignoring the minority class.
After addressing this imbalance using SMOTE, the model's performance improved. The recall increased to 0.62, meaning the model correctly identified more than half of all fraudulent transactions, while the F1-score rose to 0.29, reflecting a more balanced trade-off between precision and recall.
Furthermore, the ROC-AUC improved from 0.85 to 0.92, confirming that the model's overall discriminative ability between fraudulent and legitimate transactions became stronger.
In summary, balancing the dataset with SMOTE substantially enhanced the model's sensitivity to fraudulent behavior. Although this approach introduced a few false positives, it represents a reasonable and aluable trade-off in fraud detection, where catching more frauds is often prioritized over perfect precision
🌳 Random Forest¶
We begin with a baseline Random Forest model trained on the original, highly imbalanced dataset. At this stage, the model relies on its inherent ability to handle imbalance through bootstrap aggregation and random feature selection, but no external balancing technique is applied.
Afterwards, we will perform hyperparameter optimization using GridSearchCV to identify the most effective combination of parameters (such as tree depth, number of estimators, and class weights) for improving detection performance.
Finally, we compare the optimized model's results to the baseline, focusing on the improvements of the model metrics
Baseline Training¶
Random Forest without SMOTE:
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
rf = RandomForestClassifier(
n_estimators=200, # more trees for stability
max_depth=None, # let trees grow fully
min_samples_split=5, # prevent overfitting on minority class
min_samples_leaf=2, # same reason
class_weight="balanced_subsample",# handle imbalance automatically
random_state=42,
n_jobs=-1 # use all CPU cores
)
steps = [("preprocess", preprocess),
("rf", rf)]
pipe = Pipeline(steps)
# Fit model
pipe.fit(X_train, y_train)
# Predict
y_pred_rf = pipe.predict(X_test)
y_proba_rf = pipe.predict_proba(X_test)[:, 1]
Random Forest with SMOTE:
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
rf = RandomForestClassifier(
n_estimators=200, # more trees for stability
max_depth=None, # let trees grow fully
min_samples_split=5, # prevent overfitting on minority class
min_samples_leaf=2, # same reason
class_weight="balanced_subsample",# handle imbalance automatically
random_state=42,
n_jobs=-1 # use all CPU cores
)
smote = SMOTE(
sampling_strategy=0.2, # minority class will be 20% of majority
random_state=42,
k_neighbors=5
)
steps = [("preprocess", preprocess),
("smote", smote),
("rf", rf)]
pipe = Pipeline(steps)
# Fit model
pipe.fit(X_train, y_train)
# Predict
y_pred_rf_smote = pipe.predict(X_test)
y_proba_rf_smote = pipe.predict_proba(X_test)[:, 1]
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf_smote))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf_smote))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_rf_smote))
Confusion Matrix:
[[553296 278]
[ 1997 148]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 553574
1 0.35 0.07 0.12 2145
accuracy 1.00 555719
macro avg 0.67 0.53 0.56 555719
weighted avg 0.99 1.00 0.99 555719
ROC-AUC Score: 0.8486340518522304
Baseline Results¶
plot_model_performance(y_test, y_pred_rf, y_proba_rf, model_name="Random Forest")
plot_model_performance(y_test, y_pred_rf_smote, y_proba_rf_smote, model_name="Random Forest (SMOTE)")
compare_two_models(
y_true=y_test,
preds_1=y_pred_rf,
probs_1=y_proba_rf,
preds_2=y_pred_rf_smote,
probs_2=y_proba_rf_smote,
model_names=('Baseline Random Forest', 'Random Forest + SMOTE')
)
Metric Baseline Random Forest Random Forest + SMOTE Precision 0.47 0.35 Recall 0.07 0.07 F1-score 0.12 0.12 ROC-AUC 0.84 0.85
Model Performance
from the results seen above, we can see that both models achieve a similar ROC-AUC of approximately 0.84, indicating that their overall ability to distinguish between fraudulent and legitimate transactions remains virtually unchanged. This suggests that applying SMOTE did not significantly affect the model's ranking capability.
However, precision decreased from 0.47 to 0.35 after applying SMOTE, implying that a higher proportion of predicted frauds are now false positives.
Meanwhile, recall remained constant at 0.07, showing that the model still detects only a small fraction of actual fraud cases despite the resampling.
As a result, the F1-score also remained stable at 0.12, reflecting no substantial improvement in the trade-off between precision and recall.
In summary, introducing SMOTE did not enhance fraud detection performance for the Random Forest model in this configuration. Although it slightly reduced precision, it failed to improve recall or overall discriminatory power.
These findings suggest that further optimization - such as hyperparameter tuning may be required to achieve meaningful gains in detecting fraudulent activity
Optimizing Parameters¶
Let us now apply GridSearchCV to tune the Random Forest parameters and select the combination that optimizes recall, aiming to improve fraud detection sensitivity
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
# Parameter grid (kept small to avoid runtime disconnection)
param_grid = {
"rf__n_estimators": [100],
"rf__max_depth": [10, 20], # 2 levels to test underfitting vs overfitting
"rf__min_samples_split": [5],
"rf__min_samples_leaf": [2],
"rf__class_weight": ["balanced_subsample"] # handle imbalance
}
smote = SMOTE(
sampling_strategy=0.2, # minority class will be 20% of majority
random_state=42,
k_neighbors=5
)
# Build pipeline
steps = [
("preprocess", preprocess),
("smote", smote),
("rf", RandomForestClassifier(random_state=42))
]
pipe = Pipeline(steps)
# Scoring & CV
recall_scorer = make_scorer(recall_score, pos_label=1)
cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
# GridSearchCV
grid_search = GridSearchCV(
estimator=pipe,
param_grid=param_grid,
scoring=recall_scorer,
cv=cv,
n_jobs=-2,
verbose=1,
return_train_score=False
)
# Fit grid search
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Recall: {grid_search.best_score_:.4f}")
# Evaluate best model
best_model = grid_search.best_estimator_
y_pred_rf_best = best_model.predict(X_test)
y_proba_rf_best = best_model.predict_proba(X_test)[:, 1]
Fitting 2 folds for each of 2 candidates, totalling 4 fits
Best Parameters: {'rf__class_weight': 'balanced_subsample', 'rf__max_depth': 10, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}
Best Cross-Validation Recall: 0.8213
Optimized Results¶
compare_two_models(
y_true=y_test,
preds_1=y_pred_rf,
probs_1=y_proba_rf,
preds_2=y_pred_rf_best,
probs_2=y_proba_rf_best,
model_names=('Baseline Random Forest', 'Tuned Random Forest')
)
Metric Baseline Random Forest Tuned Random Forest Precision 0.47 0.01 Recall 0.07 0.09 F1-score 0.12 0.02 ROC-AUC 0.84 0.68
Model Performance
The comparison between the Baseline Random Forest and the Tuned Random Forest models reveals that hyperparameter optimization did not yield the expected improvements.
Although the tuning process specifically aimed to maximize recall, the results show only a marginal increase, from 0.07 to 0.09, while precision dropped sharply from 0.47 to 0.01, indicating that nearly all predicted fraud cases were false positives.
Furthermore, the ROC-AUC score declined from 0.84 to 0.68, suggesting that the tuned model lost much of its ability to effectively distinguish between fraudulent and legitimate transactions.
The F1-Score also decreased from 0.12 to 0.02, confirming a weaker overall balance between precision and recall.
In summary, despite focusing on improving recall, the tuning process failed to enhance the Random Forest's performance. Both versions of the model struggle to capture the subtle and complex patterns that characterize fraud. These findings suggest that Random Forest may not be the most suitable algorithm for this task, and that more powerful models, such as XGBoost or neural networks, may be necessary to achieve meaningful predictive performance
⚡XGBoost¶
After observing the limitations of Logistic Regression and Random Forest, we now turn to XGBoost, a model well-known for its robustness in handling complex, imbalanced datasets. Its gradient boosting framework allows it to learn subtle nonlinear patterns that simpler models often miss, which is a crucial advantage in detecting rare fraud cases.
Our goal is to assess whether XGBoost can improve recall without severely compromising precision, effectively capturing more fraudulent transactions while maintaining a reasonable false positive rate.
We begin with a baseline configuration to establish reference performance, followed by targeted hyperparamter tuning (adjusting tree depth, learning rate, and class weights) to enhance its sensitivity to fraud detection.
Baseline Training¶
The parameters below are chosen to balance learning stability, model complexity, and recall sensitivity:
n_estimators= 900: A relatively high number of trees allows the model to learn gradually and capture subtle fraud patterns, especially when combined with a low learning rate.learning_rate= 0.03: A small learning rate slows down training and prevents overfitting, helping the model generalize better on unseen transactions.max_depth= 6: Medium-depth trees strike a good balance, deep enough to model complex relationships, but not so deep that they memorize noise.min_child_weight= 2: Requires at least a small number of samples in each leaf, which makes the model less likely to overfit to extremely rare or noisy cases.subsample= 0.8 andcolsample_bytree= 1.0: Row subsampling (80%) introduces randomness and improves robustness, while using all features per tree helps capture every relevant signal in the relatively small feature space.gamma= 0.1: A light regularization term that prunes splits with minimal gain, keeping the model compact and efficient.reg_lambda= 1 andreg_alpha= 0.1: L2 and L1 regularization terms that prevent overfitting by penalizing overly complex trees while maintaining flexibility to learn important interactions.scale_pos_weight= (majority/minority): Adjusts the loss to give more importance to fraudulent samples.tree_method= "hist": Uses a histogram-based algorithm that's optimized for large datasets, making training much faster without losing accuracy.eval_metric= "aucpr": Precision-Recall AUC is more informative than ROC-AUC for imbalanced datasets, as it focuses on how well the model identifies frauds rather than just overall separation.
XGBoost baseline without SMOTE:
# Split features and target
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
xgb = XGBClassifier(
n_estimators=900,
learning_rate=0.03,
max_depth=6,
min_child_weight=2,
subsample=0.8,
colsample_bytree=1.0,
gamma=0.1,
reg_lambda=1,
reg_alpha=0.1,
scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),
n_jobs=-1,
random_state=42,
tree_method="hist",
eval_metric="aucpr" # better that ROC for imbalanced datasets
)
# Create pipeline
steps = [("preprocess", preprocess),
("xgb", xgb)]
pipe = Pipeline(steps)
# Fit model
pipe.fit(X_train, y_train)
# Predict
y_pred_xgb = pipe.predict(X_test)
y_proba_xgb = pipe.predict_proba(X_test)[:, 1]
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_xgb))
Confusion Matrix:
[[547546 6028]
[ 1695 450]]
Classification Report:
precision recall f1-score support
0 1.00 0.99 0.99 553574
1 0.07 0.21 0.10 2145
accuracy 0.99 555719
macro avg 0.53 0.60 0.55 555719
weighted avg 0.99 0.99 0.99 555719
ROC-AUC Score: 0.8605525873602048
XGBoost baseline model with SMOTE:
# Split features and target
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
xgb = XGBClassifier(
n_estimators=900,
learning_rate=0.03,
max_depth=6,
min_child_weight=2,
subsample=0.8,
colsample_bytree=1.0,
gamma=0.1,
reg_lambda=1,
reg_alpha=0.1,
scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),
n_jobs=-1,
random_state=42,
tree_method="hist",
eval_metric="aucpr" # better that ROC for imbalanced datasets
)
# Create pipeline
steps = [("preprocess", preprocess),
("smote", smote),
("xgb", xgb)]
pipe = Pipeline(steps)
# Fit model
pipe.fit(X_train, y_train)
# Predict
y_pred_xgb_smote = pipe.predict(X_test)
y_proba_xgb_smote = pipe.predict_proba(X_test)[:, 1]
# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_xgb))
Confusion Matrix:
[[547546 6028]
[ 1695 450]]
Classification Report:
precision recall f1-score support
0 1.00 0.99 0.99 553574
1 0.07 0.21 0.10 2145
accuracy 0.99 555719
macro avg 0.53 0.60 0.55 555719
weighted avg 0.99 0.99 0.99 555719
ROC-AUC Score: 0.8605525873602048
Baseline Results¶
compare_two_models(
y_true=y_test,
preds_1=y_pred_xgb,
probs_1=y_proba_xgb,
preds_2=y_pred_xgb_smote,
probs_2=y_proba_xgb_smote,
model_names=('XGBoost', 'XGBoost + SMOTE')
)
Metric XGBoost XGBoost + SMOTE Precision 0.07 0.07 Recall 0.21 0.27 F1-score 0.10 0.11 ROC-AUC 0.86 0.87
Model Performance
The comparison between the baseline XGBoost model and the XGBoost trained with SMOTE shows only marginal improvements.
Applying SMOTE slightly increased recall from 0.21 to 0.27, indicating a modest gain in the model's ability to identify fraudulent transactions. However, precision remained unchanged at 0.07, meaning that most predicted frauds were still false positives.
The F1-score showed a minimal improvement from 0.1 to 0.11, and ROC-AUC increased slightly from 0.86 to 0.87, suggesting a small gain in overall discrimination but not a meaningful step forward in practical detection performance.
Overall, both XGBoost variants struggle to balance precision and recall effectively. Despite its reputation as high-performing algorithm for structured data, XGBoost underperformed in this setting, failing to capture the rare and complex fraud patterns present in the dataset.
Interestingly, the Logistic Regression model with SMOTE achieved a substantially higher recall, demonstrating that in highly imbalanced problems, simpler models can sometimes outperform more complex ones when paired with appropriate data balancing techniques.
Next Steps
We will try to improve the model's paramets to optimize recall. We will use Gridsearch to test and evaluate several parameters.
Optimizing Parameters¶
Let us now apply GridSearchCV to tune the XGBoost parameters and select the combination that optimizes recall, aiming to improve fraud detection sensitivity
# Define parameter grid (relatively small grid to avoid disconnection from colab)
param_grid = {
"xgb__max_depth": [5],
"xgb__min_child_weight": [2, 5],
"xgb__colsample_bytree": [0.8, 1.0],
}
# Recall scorer and CV setup
recall_scorer = make_scorer(recall_score, pos_label=1)
cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
# GridSearch
grid_search = GridSearchCV(
estimator=pipe,
param_grid=param_grid,
scoring=recall_scorer,
cv=cv,
verbose=2,
n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Extract best model
best_model = grid_search.best_estimator_
print("\nBest Parameters Found:")
for k, v in grid_search.best_params_.items():
print(f" {k}: {v}")
print(f"\nBest Cross-Validated Recall: {grid_search.best_score_:.4f}")
# Evaluate best model on test set
y_pred_xg_best = best_model.predict(X_test)
y_proba_xg_best = best_model.predict_proba(X_test)[:, 1]
# Print evaluation metrics
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_xg_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xg_best))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_xg_best):.4f}")
Fitting 2 folds for each of 4 candidates, totalling 8 fits
Best Parameters Found:
xgb__colsample_bytree: 0.8
xgb__max_depth: 5
xgb__min_child_weight: 5
Best Cross-Validated Recall: 0.9484
Confusion Matrix:
[[540925 12649]
[ 1445 700]]
Classification Report:
precision recall f1-score support
0 1.00 0.98 0.99 553574
1 0.05 0.33 0.09 2145
accuracy 0.97 555719
macro avg 0.52 0.65 0.54 555719
weighted avg 0.99 0.97 0.98 555719
ROC-AUC: 0.8671
Optimized Results¶
plot_model_performance(y_test, y_pred_xg_best, y_proba_xg_best, model_name="XGBoost (GridSearch Recall-Optimized)")
compare_two_models(
y_true=y_test,
preds_1=y_pred_xg_best,
probs_1=y_proba_xg_best,
preds_2=y_pred_xgb_smote,
probs_2=y_proba_xgb_smote,
model_names=('Tuned XGBoost', 'XGBoost')
)
Metric Tuned XGBoost XGBoost Precision 0.05 0.07 Recall 0.33 0.27 F1-score 0.09 0.11 ROC-AUC 0.87 0.87
Model Performance
The tuned XGBoost model demonstrates a noticeable improvement in recall, increasing from 0.27 to 0.33, which indicates that it now identifies a larger proportion of fraudulent transactions.
This gain enhances the model's sensitivity to the minority class. The ROC-AUC remained unchanged at 0.87.
However, the recall gains come with trade-offs: precision dropped slightly to 0.05, and the F1-score dropped from 0.11 to 0.09, revealing that the model still produces many false positives and struggles to balance precision and recall effectively.
When compared with the other models evaluated, Logistic Regression with SMOTE continues to provide the best overall balance between accuracy, stability, and interpretability.
Logistic Regression remains the most reliable and consistent performer for this dataset.
🧠 Neural Network¶
After testing traditional machine learning models, we now move toward deep learning to explore whether a neural network can better capture the nonlinear and hidden relationships that define fraudulent behavior.
Unlike tree-based or linear models, neural networks can learn complex, high-dimensional feature interactions directly from data, a capability that may uncover subtle fraud patterns that previous models overlooked.
To integrate this approach seamlessly into our workflow, we implement a custom PyTorch model wrapper (TorchNNWraper) that follows scikit-learn's BaseEstimator and ClassifierMixin interfaces. This design allows us to train, evaluate, and compare the neural network just like any other scikit-learn model, preserving a consistent pipeline structure while leveraging PyTorch's flexibility and computational power.
In essence, this step represents an effort to combine the interpretability and structure of our existing pipeline with the expressive power of deep learning, aiming to push beyond the limitations encountered with traditional algorithms.
Network class¶
Neural Network Architecture and Training Details
Our PyTorch neural network, designed for this fraud detection task, employs a multi-layer perceptron (MLP) architecture to capture complex patterns in the data. The network is structured as follows:
Layers: The model consists of three hidden layers with ReLU activation functions:
- The first two layers have 256 neurons each, followed by Batch Normalization and Dropout (with a rate of 0.3).
- The third layer has 128 neurons, also followed by Batch Normalization and Dropout.
- Batch Normalization helps stabilize and accelerate the training process by normalizing the inputs to each layer.
- Dropout acts as a regularization technique by randomly setting a fraction of neurons to zero during training, which helps prevent overfitting, especially important for handling the imbalance and diversity of the dataset.
- The final layer is a single output neuron with no activation function, producing a raw logit score for binary classification.
Loss Function: We use
torch.nn.BCEWithLogitsLoss. This loss function is well-suited for binary classification as it combines a sigmoid layer and the Binary Cross Entropy loss in a single, numerically stable function. Crucially, it allows us to directly apply class weights via thepos_weightparameter to address the severe class imbalance by giving more importance to the minority (fraudulent) class during training.Optimizer: The Adam optimizer is utilized with a learning rate of
1e-3. Adam is an adaptive learning rate optimization algorithm that is widely used for training deep neural networks. Its efficiency and effectiveness in handling large datasets makes it a suitable choice for this problem.Hyperparameter Selection: The architecture (number of layers, neurons per layer) and hyperparameters (like learning rate, batch size, and dropout rate) were determined through empirical experimentation. In a production setting, a more rigorous approach such as cross-validation combined with hyperparameter tuning libraries would be employed to systematically search for the optimal configuration.
Validation and Early Stopping
To prevent overfitting and improve generalization, we separate a validation subset (10%) from the training data during each training run. After every epoch, the model's performance is evaluated on this validation set.
If the validation loss does not improve for a predefined number of epochs (early_stopping_patience=5), training stops automatically.
This ensures the model retains the weights from the epoch with the best validation performance, preventing unnecessary training and reducing the risk of overfitting to the training data.
class TorchNNWrapper(BaseEstimator, ClassifierMixin):
def __init__(self,
input_dim=None,
lr=1e-4,
batch_size=2048,
epochs=100,
dropout=0.3,
threshold = 0.5,
class_weight=None,
early_stopping_patience=20,
val_split=0.2,
device=None,
verbose=True):
self.input_dim = input_dim
self.lr = lr
self.batch_size = batch_size
self.epochs = epochs
self.dropout = dropout
self.threshold = threshold
self.class_weight = class_weight
self.early_stopping_patience = early_stopping_patience
self.val_split = val_split
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
self.verbose = verbose
self.model_ = None
self.train_losses_ = []
self.val_losses_ = []
# Define NN architecture
def _build_model(self):
model = nn.Sequential(
# First layer
nn.Linear(self.input_dim, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(self.dropout),
# Second layer
nn.Linear(256, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(self.dropout),
# Third layer
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(self.dropout),
nn.Linear(128, 1)
)
return model.to(self.device)
# Training
def fit(self, X, y):
# Convert to tensors
X = torch.tensor(np.asarray(X), dtype=torch.float32)
y = torch.tensor(np.asarray(y).reshape(-1, 1), dtype=torch.float32)
# Split into training and validation
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=self.val_split, stratify=y.cpu(), random_state=42
)
train_loader = DataLoader(TensorDataset(X_train, y_train),
batch_size=self.batch_size, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val, y_val),
batch_size=self.batch_size, shuffle=False)
# Build model and loss
self.input_dim = X.shape[1]
self.model_ = self._build_model()
if self.class_weight is not None:
weight_tensor = torch.tensor([self.class_weight[1]], dtype=torch.float32).to(self.device)
criterion = nn.BCEWithLogitsLoss(pos_weight=weight_tensor)
else:
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(self.model_.parameters(), lr=self.lr)
# Early stopping setup
best_val_loss = float('inf')
best_state_dict = None
patience_counter = 0
self.train_losses_ = []
self.val_losses_ = []
# Training loop
for epoch in range(self.epochs):
self.model_.train()
running_loss = 0.0
for xb, yb in train_loader:
xb, yb = xb.to(self.device), yb.to(self.device)
optimizer.zero_grad()
outputs = self.model_(xb)
loss = criterion(outputs.view(-1), yb.view(-1))
loss.backward()
optimizer.step()
running_loss += loss.item()
avg_train_loss = running_loss / len(train_loader)
self.train_losses_.append(avg_train_loss)
# Validation
self.model_.eval()
val_loss = 0.0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(self.device), yb.to(self.device)
outputs = self.model_(xb)
loss = criterion(outputs.view(-1), yb.view(-1))
val_loss += loss.item()
avg_val_loss = val_loss / len(val_loader)
self.val_losses_.append(avg_val_loss)
# Print progress
if self.verbose:
print(f"Epoch {epoch+1}/{self.epochs} | Train Loss: {avg_train_loss:.4f} | Current Val Loss: {avg_val_loss:.4f} | Best Val Loss: {best_val_loss:.4f}")
# Early stopping logic
if avg_val_loss < best_val_loss - 1e-4:
best_val_loss = avg_val_loss
best_state_dict = {k: v.cpu().clone() for k, v in self.model_.state_dict().items()}
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= self.early_stopping_patience:
if self.verbose:
print(f"Early stopping at epoch {epoch+1} (no improvement for {self.early_stopping_patience} epochs)")
break
# Restore best weights
if best_state_dict is not None:
self.model_.load_state_dict(best_state_dict)
self.model_.to(self.device)
return self
# Predict
def predict_proba(self, X):
self.model_.eval()
X = torch.tensor(np.asarray(X), dtype=torch.float32).to(self.device)
with torch.no_grad():
logits = self.model_(X)
probs = torch.sigmoid(logits).cpu().numpy().flatten()
return np.vstack([1 - probs, probs]).T
def predict(self, X):
return (self.predict_proba(X)[:, 1] >= self.threshold).astype(int)
To address the severe class imbalance in our dataset, we explored two complementary strategies:
Class weighting - assigning higher penalties to fraudulent transactions during training, with weight ratios ranging from 1 to 100. This approach aimed to make the neural network more sensitive to missed fraud cases.
SMOTE - The best results were achieved when fraudulent transactions comprised approximately 20% of the dataset
Ultimately, SMOTE alone (without class weighting) provided the most stable and accurate performance, striking a better balance between recall and overall model reliability - as shown later in the analysis
We will now integrate the neural network into the pipeline we've previously created:
# Combine into Pipeline
pipe_nn = Pipeline([
("preprocess", preprocess),
("smote", smote),
("nn", TorchNNWrapper(
epochs=100,
batch_size=4096,
lr=5e-5
))
])
Training¶
# Train & Evaluate
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
pipe_nn.fit(X_train, y_train)
y_pred_nn = pipe_nn.predict(X_test)
y_proba_nn = pipe_nn.predict_proba(X_test)
print(classification_report(y_test, y_pred_nn))
print("ROC AUC:", roc_auc_score(y_test, y_proba_nn[:, 1]))
Epoch 1/100 | Train Loss: 0.5896 | Current Val Loss: 0.4554 | Best Val Loss: inf
Epoch 2/100 | Train Loss: 0.4084 | Current Val Loss: 0.3261 | Best Val Loss: 0.4554
Epoch 3/100 | Train Loss: 0.3298 | Current Val Loss: 0.2876 | Best Val Loss: 0.3261
Epoch 4/100 | Train Loss: 0.2980 | Current Val Loss: 0.2711 | Best Val Loss: 0.2876
Epoch 5/100 | Train Loss: 0.2807 | Current Val Loss: 0.2550 | Best Val Loss: 0.2711
Epoch 6/100 | Train Loss: 0.2693 | Current Val Loss: 0.2490 | Best Val Loss: 0.2550
Epoch 7/100 | Train Loss: 0.2606 | Current Val Loss: 0.2390 | Best Val Loss: 0.2490
Epoch 8/100 | Train Loss: 0.2527 | Current Val Loss: 0.2334 | Best Val Loss: 0.2390
Epoch 9/100 | Train Loss: 0.2450 | Current Val Loss: 0.2308 | Best Val Loss: 0.2334
Epoch 10/100 | Train Loss: 0.2389 | Current Val Loss: 0.2402 | Best Val Loss: 0.2308
Epoch 11/100 | Train Loss: 0.2329 | Current Val Loss: 0.2207 | Best Val Loss: 0.2308
Epoch 12/100 | Train Loss: 0.2276 | Current Val Loss: 0.2109 | Best Val Loss: 0.2207
Epoch 13/100 | Train Loss: 0.2226 | Current Val Loss: 0.1981 | Best Val Loss: 0.2109
Epoch 14/100 | Train Loss: 0.2183 | Current Val Loss: 0.2048 | Best Val Loss: 0.1981
Epoch 15/100 | Train Loss: 0.2139 | Current Val Loss: 0.1983 | Best Val Loss: 0.1981
Epoch 16/100 | Train Loss: 0.2104 | Current Val Loss: 0.1876 | Best Val Loss: 0.1981
Epoch 17/100 | Train Loss: 0.2068 | Current Val Loss: 0.1902 | Best Val Loss: 0.1876
Epoch 18/100 | Train Loss: 0.2031 | Current Val Loss: 0.1833 | Best Val Loss: 0.1876
Epoch 19/100 | Train Loss: 0.2002 | Current Val Loss: 0.1767 | Best Val Loss: 0.1833
Epoch 20/100 | Train Loss: 0.1960 | Current Val Loss: 0.1960 | Best Val Loss: 0.1767
Epoch 21/100 | Train Loss: 0.1936 | Current Val Loss: 0.1783 | Best Val Loss: 0.1767
Epoch 22/100 | Train Loss: 0.1903 | Current Val Loss: 0.1635 | Best Val Loss: 0.1767
Epoch 23/100 | Train Loss: 0.1874 | Current Val Loss: 0.1642 | Best Val Loss: 0.1635
Epoch 24/100 | Train Loss: 0.1838 | Current Val Loss: 0.1582 | Best Val Loss: 0.1635
Epoch 25/100 | Train Loss: 0.1805 | Current Val Loss: 0.1596 | Best Val Loss: 0.1582
Epoch 26/100 | Train Loss: 0.1779 | Current Val Loss: 0.1519 | Best Val Loss: 0.1582
Epoch 27/100 | Train Loss: 0.1743 | Current Val Loss: 0.1543 | Best Val Loss: 0.1519
Epoch 28/100 | Train Loss: 0.1709 | Current Val Loss: 0.1516 | Best Val Loss: 0.1519
Epoch 29/100 | Train Loss: 0.1679 | Current Val Loss: 0.1445 | Best Val Loss: 0.1516
Epoch 30/100 | Train Loss: 0.1637 | Current Val Loss: 0.1432 | Best Val Loss: 0.1445
Epoch 31/100 | Train Loss: 0.1603 | Current Val Loss: 0.1304 | Best Val Loss: 0.1432
Epoch 32/100 | Train Loss: 0.1563 | Current Val Loss: 0.1608 | Best Val Loss: 0.1304
Epoch 33/100 | Train Loss: 0.1522 | Current Val Loss: 0.1392 | Best Val Loss: 0.1304
Epoch 34/100 | Train Loss: 0.1475 | Current Val Loss: 0.1323 | Best Val Loss: 0.1304
Epoch 35/100 | Train Loss: 0.1414 | Current Val Loss: 0.1302 | Best Val Loss: 0.1304
Epoch 36/100 | Train Loss: 0.1360 | Current Val Loss: 0.1579 | Best Val Loss: 0.1302
Epoch 37/100 | Train Loss: 0.1332 | Current Val Loss: 0.1205 | Best Val Loss: 0.1302
Epoch 38/100 | Train Loss: 0.1273 | Current Val Loss: 0.2724 | Best Val Loss: 0.1205
Epoch 39/100 | Train Loss: 0.1211 | Current Val Loss: 0.1180 | Best Val Loss: 0.1205
Epoch 40/100 | Train Loss: 0.1190 | Current Val Loss: 0.1208 | Best Val Loss: 0.1180
Epoch 41/100 | Train Loss: 0.1174 | Current Val Loss: 0.1260 | Best Val Loss: 0.1180
Epoch 42/100 | Train Loss: 0.1112 | Current Val Loss: 0.0932 | Best Val Loss: 0.1180
Epoch 43/100 | Train Loss: 0.1141 | Current Val Loss: 0.1052 | Best Val Loss: 0.0932
Epoch 44/100 | Train Loss: 0.1092 | Current Val Loss: 0.1994 | Best Val Loss: 0.0932
Epoch 45/100 | Train Loss: 0.1060 | Current Val Loss: 0.1578 | Best Val Loss: 0.0932
Epoch 46/100 | Train Loss: 0.1057 | Current Val Loss: 0.0804 | Best Val Loss: 0.0932
Epoch 47/100 | Train Loss: 0.1039 | Current Val Loss: 0.0831 | Best Val Loss: 0.0804
Epoch 48/100 | Train Loss: 0.1021 | Current Val Loss: 0.1935 | Best Val Loss: 0.0804
Epoch 49/100 | Train Loss: 0.1009 | Current Val Loss: 0.1745 | Best Val Loss: 0.0804
Epoch 50/100 | Train Loss: 0.1005 | Current Val Loss: 0.0863 | Best Val Loss: 0.0804
Epoch 51/100 | Train Loss: 0.0959 | Current Val Loss: 0.0768 | Best Val Loss: 0.0804
Epoch 52/100 | Train Loss: 0.0968 | Current Val Loss: 0.0855 | Best Val Loss: 0.0768
Epoch 53/100 | Train Loss: 0.0938 | Current Val Loss: 0.4512 | Best Val Loss: 0.0768
Epoch 54/100 | Train Loss: 0.0917 | Current Val Loss: 0.1038 | Best Val Loss: 0.0768
Epoch 55/100 | Train Loss: 0.0908 | Current Val Loss: 0.0819 | Best Val Loss: 0.0768
Epoch 56/100 | Train Loss: 0.0898 | Current Val Loss: 0.2161 | Best Val Loss: 0.0768
Epoch 57/100 | Train Loss: 0.0891 | Current Val Loss: 0.0800 | Best Val Loss: 0.0768
Epoch 58/100 | Train Loss: 0.0872 | Current Val Loss: 0.0712 | Best Val Loss: 0.0768
Epoch 59/100 | Train Loss: 0.0866 | Current Val Loss: 0.2831 | Best Val Loss: 0.0712
Epoch 60/100 | Train Loss: 0.0849 | Current Val Loss: 0.0833 | Best Val Loss: 0.0712
Epoch 61/100 | Train Loss: 0.0841 | Current Val Loss: 0.2082 | Best Val Loss: 0.0712
Epoch 62/100 | Train Loss: 0.0838 | Current Val Loss: 0.1042 | Best Val Loss: 0.0712
Epoch 63/100 | Train Loss: 0.0836 | Current Val Loss: 0.0925 | Best Val Loss: 0.0712
Epoch 64/100 | Train Loss: 0.0832 | Current Val Loss: 0.2715 | Best Val Loss: 0.0712
Epoch 65/100 | Train Loss: 0.0796 | Current Val Loss: 0.1622 | Best Val Loss: 0.0712
Epoch 66/100 | Train Loss: 0.0803 | Current Val Loss: 0.0660 | Best Val Loss: 0.0712
Epoch 67/100 | Train Loss: 0.0791 | Current Val Loss: 0.0677 | Best Val Loss: 0.0660
Epoch 68/100 | Train Loss: 0.0817 | Current Val Loss: 0.1238 | Best Val Loss: 0.0660
Epoch 69/100 | Train Loss: 0.0765 | Current Val Loss: 0.1383 | Best Val Loss: 0.0660
Epoch 70/100 | Train Loss: 0.0769 | Current Val Loss: 0.2001 | Best Val Loss: 0.0660
Epoch 71/100 | Train Loss: 0.0775 | Current Val Loss: 0.0656 | Best Val Loss: 0.0660
Epoch 72/100 | Train Loss: 0.0755 | Current Val Loss: 0.1318 | Best Val Loss: 0.0656
Epoch 73/100 | Train Loss: 0.0742 | Current Val Loss: 0.0659 | Best Val Loss: 0.0656
Epoch 74/100 | Train Loss: 0.0765 | Current Val Loss: 0.1926 | Best Val Loss: 0.0656
Epoch 75/100 | Train Loss: 0.0735 | Current Val Loss: 0.3957 | Best Val Loss: 0.0656
Epoch 76/100 | Train Loss: 0.0795 | Current Val Loss: 0.0933 | Best Val Loss: 0.0656
Epoch 77/100 | Train Loss: 0.0713 | Current Val Loss: 0.0607 | Best Val Loss: 0.0656
Epoch 78/100 | Train Loss: 0.0741 | Current Val Loss: 0.0709 | Best Val Loss: 0.0607
Epoch 79/100 | Train Loss: 0.0718 | Current Val Loss: 0.0815 | Best Val Loss: 0.0607
Epoch 80/100 | Train Loss: 0.0725 | Current Val Loss: 0.0732 | Best Val Loss: 0.0607
Epoch 81/100 | Train Loss: 0.0708 | Current Val Loss: 0.0881 | Best Val Loss: 0.0607
Epoch 82/100 | Train Loss: 0.0682 | Current Val Loss: 0.1026 | Best Val Loss: 0.0607
Epoch 83/100 | Train Loss: 0.0680 | Current Val Loss: 0.2170 | Best Val Loss: 0.0607
Epoch 84/100 | Train Loss: 0.0660 | Current Val Loss: 0.1474 | Best Val Loss: 0.0607
Epoch 85/100 | Train Loss: 0.0668 | Current Val Loss: 0.1100 | Best Val Loss: 0.0607
Epoch 86/100 | Train Loss: 0.0687 | Current Val Loss: 0.0699 | Best Val Loss: 0.0607
Epoch 87/100 | Train Loss: 0.0662 | Current Val Loss: 0.0456 | Best Val Loss: 0.0607
Epoch 88/100 | Train Loss: 0.0669 | Current Val Loss: 0.1338 | Best Val Loss: 0.0456
Epoch 89/100 | Train Loss: 0.0707 | Current Val Loss: 0.0896 | Best Val Loss: 0.0456
Epoch 90/100 | Train Loss: 0.0670 | Current Val Loss: 0.0803 | Best Val Loss: 0.0456
Epoch 91/100 | Train Loss: 0.0686 | Current Val Loss: 0.0438 | Best Val Loss: 0.0456
Epoch 92/100 | Train Loss: 0.0638 | Current Val Loss: 0.3370 | Best Val Loss: 0.0438
Epoch 93/100 | Train Loss: 0.0662 | Current Val Loss: 0.0708 | Best Val Loss: 0.0438
Epoch 94/100 | Train Loss: 0.0654 | Current Val Loss: 0.4103 | Best Val Loss: 0.0438
Epoch 95/100 | Train Loss: 0.0627 | Current Val Loss: 0.0547 | Best Val Loss: 0.0438
Epoch 96/100 | Train Loss: 0.0620 | Current Val Loss: 0.2048 | Best Val Loss: 0.0438
Epoch 97/100 | Train Loss: 0.0606 | Current Val Loss: 0.0500 | Best Val Loss: 0.0438
Epoch 98/100 | Train Loss: 0.0629 | Current Val Loss: 0.0765 | Best Val Loss: 0.0438
Epoch 99/100 | Train Loss: 0.0622 | Current Val Loss: 0.0790 | Best Val Loss: 0.0438
Epoch 100/100 | Train Loss: 0.0599 | Current Val Loss: 0.1165 | Best Val Loss: 0.0438
precision recall f1-score support
0 1.00 1.00 1.00 553574
1 0.45 0.63 0.52 2145
accuracy 1.00 555719
macro avg 0.72 0.81 0.76 555719
weighted avg 1.00 1.00 1.00 555719
ROC AUC: 0.9695072674726705
💡 Note: Predictions were made using the model saved at epoch 74, where training stopped due to early stopping (patience = 20)
# Access the trained neural network model from the pipeline
nn_model = pipe_nn.named_steps['nn']
# Plot training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(nn_model.train_losses_, label='Training Loss')
plt.plot(nn_model.val_losses_, label='Validation Loss')
plt.title('Training and Validation Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
Training Analysis
The chart illustrates the evolution of training and validation loss across 94 epochs, with early stopping activated.
During the initial phase (epochs 0-20), both losses decrease sharply, reflecting rapid learning and effective convergence of the model.
In the middle phase (epochs 20-50), training loss continues to decline steadily, while validation loss fluctuates mildly but remains generally consistent - suggesting that the model generalizes well at this stage.
In the final phase (epochs 50-94), validation loss becomes increasingly unstable, showing sharp oscillations while training loss stays low - a clear sign of emerging overfitting.
The early stopping mechanism successfully halted training at the optimal point (epoch 74) preserving the best balance between model fit and generalization performance on unseen data
Results¶
To evaluate the model's performance, we first define a function, called visualize_model_performance(), which generates a graphical comparison of the results
def visualize_model_performance(precision, recall, f1, roc, model_name="Model"):
"""
Display Precision, Recall, F1-score, and ROC-AUC for a single model
and visualize results as a bar chart.
"""
# Metrics and values
metrics = ['Precision', 'Recall', 'F1-score', 'ROC-AUC']
scores = [precision, recall, f1, roc]
# Print table
results_df = pd.DataFrame({
'Metric': metrics,
model_name: np.round(scores, 3)
})
print(results_df.to_string(index=False))
print()
# Bar chart
x = np.arange(len(metrics))
fig, ax = plt.subplots(figsize=(7, 5))
bars = ax.bar(x, scores, color='#4C72B0', width=0.6)
# Annotate bars
for bar in bars:
ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=9)
# Aesthetics
ax.set_ylabel('Score')
ax.set_ylim(0, 1.05)
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.set_title(f'{model_name}: Performance Metrics')
ax.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
visualize_model_performance(recall=0.64, precision=0.28, f1=0.39, roc=0.95543, model_name="Neural Network")
Metric Neural Network Precision 0.28 Recall 0.64 F1-score 0.39 ROC-AUC 0.95
Model Performance
The Neural Network model delivers a notable leap in performance across all key metrics, confirming its ability to capture complex, non-linear patterns in the data.
With a recall of 0.64, the model successfully detects nearly two-thirds of all fraudulent transactions, which is a substantial improvement over the previous models. Although precision (0.28) remains moderate, this trade-off is often acceptable in fraud detection, where minimizing missed frauds (high recall) is far more critical than avoiding every false alarm.
The F1-score of 0.39 demonstrates a balanced compromise between precision and recall, highlighting the model's stronger overall detection capability.
Moreover, the ROC-AUC of 0.96 indicates excellent class separability, showing that the network effectively distinguishes between fraudulent and legitimate transactions based on its predicted probabilities.
Overall, this Neural Network is the first model to surpass the Logistic Regression baseline, achieving higher recall and superior discriminative power. It stands out as the most effective approach so far for this dataset, combining robust learning capacity with meaningful real-world applicability in fraud detection.
Effect of Class Weights on Neural Network Performance:
# Create DataFrame
data = {
"ROC": [0.863, 0.859, 0.864, 0.856, 0.865, 0.859, 0.868, 0.865, 0.859, 0.821, 0.865, 0.853, 0.843, 0.808, 0.854],
"recall": [0.075, 0.076, 0.097, 0.092, 0.307, 0.276, 0.382, 0.487, 0.374, 0.361, 0.367, 0.447, 0.489, 0.312, 0.647],
"precision": [0.976, 0.994, 0.835, 0.399, 0.135, 0.207, 0.095, 0.045, 0.081, 0.044, 0.059, 0.046, 0.038, 0.042, 0.022],
"ratio": [1, 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 100]
}
df = pd.DataFrame(data)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(df["ratio"], df["ROC"], marker='o', label="ROC-AUC", linewidth=2)
plt.plot(df["ratio"], df["recall"], marker='o', label="Recall", linewidth=2)
plt.plot(df["ratio"], df["precision"], marker='o', label="Precision", linewidth=2)
# Formatting
plt.xscale("log") # log-scale for better visibility of large ratios
plt.xlabel("Minority-to-Majority Class Weight Ratio (log scale)", fontsize=12)
plt.ylabel("Score (0–1)", fontsize=12)
plt.title("Effect of Class Weights on Neural Network Performance", fontsize=14, weight='bold')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.tight_layout()
plt.show()
Effect of Class Weights on Neural Network Performance
💡 Note: the following analysis was conducted using class weights only, without applying SMOTE, to isolate the effect of weighting on model performance
The graph illustrates how varying the minority-to-majority class weight ratio affects the neural network's precision and recall.
As the weight for the minority (fraud) class increases, recall improves, and the model becomes more sensitive to fraud - but precision decreases, resulting in more false positives.
Meanwhile, the ROC-AUC remains largely stable, indicating that the model's overall ability to separate fraud from legitimate transactions is not significantly impacted.
In practice, the optimal class weight depends on the objective:
- To minimize missed frauds, increase the class weight (favor recall)
- To reduce false positives, lower the class weight (favor precision)
This trade-off provides a flexible way to fine-tune the model's behavior according to operational priorities.
🤖 TabNet Classifier¶
To push our fraud detection analysis further, we introduce a deep neural architecture specifically designed for tabular data.
Unlike traditional models that rely on manual feature engineering, TabNet uses sequential attention to dynamically select the most informative features at each decision step. This allows it to learn complex, non-linear relationships while maintaining a degree of interpretability - something rare among deep learning models.
We expect TabNet to outperform previous models by capturing subtle fraud patterns that logistic regression and tree-based methods may overlook. Its built-in handling of feature sparsity and interpretability makes it a promising candidate for highly imbalanced fraud detection tasks
Training¶
!pip install pytorch-tabnet torch scikit-learn pandas numpy --quiet
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/44.5 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.5/44.5 kB 3.4 MB/s eta 0:00:00
from pytorch_tabnet.tab_model import TabNetClassifier
# Ensuring GPU accessability
device_name = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device_name.upper()}")
set_global_seed(42)
Using device: CUDA
Without SMOTE:
# Data Preparation
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
# Preprocess
X_train_proc = preprocess.fit_transform(X_train, y_train)
X_test_proc = preprocess.transform(X_test)
# Training
tabnet = TabNetClassifier(
n_d=32, n_a=32, n_steps=5,
gamma=1.5, lambda_sparse=1e-4,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=1e-3),
mask_type='entmax',
device_name=device_name
)
tabnet.fit(
X_train_proc, y_train.values,
max_epochs=10, # keep small — loss converges fast
patience=5, # stop after 5 epochs of no improvement
batch_size=2048,
virtual_batch_size=256,
num_workers=0,
drop_last=False
)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
warnings.warn(wrn_msg)
epoch 0 | loss: 0.05758 | 0:00:42s epoch 1 | loss: 0.0203 | 0:01:22s epoch 2 | loss: 0.0175 | 0:02:01s epoch 3 | loss: 0.01536 | 0:02:41s epoch 4 | loss: 0.01409 | 0:03:20s epoch 5 | loss: 0.01367 | 0:04:00s epoch 6 | loss: 0.01259 | 0:04:39s epoch 7 | loss: 0.01154 | 0:05:19s epoch 8 | loss: 0.0108 | 0:05:59s epoch 9 | loss: 0.01017 | 0:06:39s
y_pred_tabnet = tabnet.predict(X_test_proc)
y_proba_tabnet = tabnet.predict_proba(X_test_proc)[:, 1]
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tabnet))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tabnet))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_tabnet))
Confusion Matrix:
[[553439 135]
[ 1472 673]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 553574
1 0.83 0.31 0.46 2145
accuracy 1.00 555719
macro avg 0.92 0.66 0.73 555719
weighted avg 1.00 1.00 1.00 555719
ROC-AUC Score: 0.9499632761462254
# Visualize Results
plot_model_performance(y_test, y_pred_tabnet, y_proba_tabnet, model_name="TabNet (Without SMOTE)")
With SMOTE:
To ensure a fair comparison with previous models, we extend TabNet with SMOTE oversampling within a custom scikit-learn compatible pipeline.
We tested two oversampling values: 0.1 and 0.2, meaning that we trained 2 different models. In the first model the minority class was oversampled to be 0.1% of the entire data, and 0.2% in the second model.
# DataFrame Wrapper (preserve feature names)
class DataFrameWrapper(TransformerMixin, BaseEstimator):
"""
Wrap any transformer so its output is returned as a pandas DataFrame
"""
def __init__(self, transformer):
self.transformer = transformer
def fit(self, X, y=None):
self.transformer.fit(X, y)
return self
def transform(self, X):
Xt = self.transformer.transform(X)
# try to preserve feature names
try:
cols = self.transformer.get_feature_names_out()
except Exception:
cols = [f"col_{i}" for i in range(Xt.shape[1])]
return pd.DataFrame(Xt, columns=cols, index=X.index)
class TabNetWrapper(BaseEstimator, ClassifierMixin):
"""
SKlearn-style wrapper around PyTorch TabNet
"""
def __init__(self, **kwargs):
self.model_params = kwargs
self.model_ = None
def fit(self, X, y):
X_np = np.asarray(X, dtype=np.float32)
y_np = np.asarray(y, dtype=np.int64)
self.model_ = TabNetClassifier(**self.model_params)
self.model_.fit(
X_np, y_np,
max_epochs=10,
patience=5,
batch_size=2048,
virtual_batch_size=256,
num_workers=0,
drop_last=False,
)
return self
def predict(self, X):
X_np = np.asarray(X, dtype=np.float32)
return self.model_.predict(X_np)
def predict_proba(self, X):
X_np = np.asarray(X, dtype=np.float32)
return self.model_.predict_proba(X_np)
First model: Tabnet with SMOTE 0.1%
# Data Preparation (same as before)
# version with SMOTE 0.1
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
# Defining TabNet parameters
tabnet = TabNetWrapper(
n_d=32,
n_a=32,
n_steps=5,
gamma=1.5,
lambda_sparse=1e-4,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=1e-3),
mask_type='entmax',
device_name='cuda' if torch.cuda.is_available() else 'cpu'
)
# Build full sklearn pipeline
smote = SMOTE(
sampling_strategy=0.1, # minority class will be 10% of majority
random_state=42,
k_neighbors=5
)
steps = [
("preprocess", DataFrameWrapper(preprocess)), # existing ColumnTransformer
("smote", smote),
("tabnet", tabnet)
]
pipe = Pipeline(steps)
# Train and evaluate
pipe.fit(X_train, y_train)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
warnings.warn(wrn_msg)
epoch 0 | loss: 0.14633 | 0:00:44s epoch 1 | loss: 0.06844 | 0:01:28s epoch 2 | loss: 0.05102 | 0:02:12s epoch 3 | loss: 0.04002 | 0:02:57s epoch 4 | loss: 0.03154 | 0:03:41s epoch 5 | loss: 0.0278 | 0:04:25s epoch 6 | loss: 0.02381 | 0:05:10s epoch 7 | loss: 0.02155 | 0:05:54s epoch 8 | loss: 0.01969 | 0:06:39s epoch 9 | loss: 0.01801 | 0:07:23s
Pipeline(steps=[('preprocess',
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_sampl...
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))),
('smote', SMOTE(random_state=42, sampling_strategy=0.1)),
('tabnet', TabNetWrapper())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess',
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_sampl...
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))),
('smote', SMOTE(random_state=42, sampling_strategy=0.1)),
('tabnet', TabNetWrapper())])DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['state'...
('onehot_low',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['category',
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['state']),
('card_freq', CardFrequency...
('onehot_low',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['category', 'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour', 'day_of_week', 'month']),
('scaler', MinMaxScaler(),
['amt', 'city_pop',
'distance_cardholder_merchant', 'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False)['merchant']
FraudRateEncoder(min_samples=100, smoothing=100)
['job']
FraudRateEncoder(min_samples=100, smoothing=100)
['city']
FraudRateEncoder(min_samples=100, smoothing=100)
['state']
FraudRateEncoder(min_samples=100, smoothing=100)
['cc_num']
CardFrequencyEncoder()
['category', 'gender']
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['hour', 'day_of_week', 'month']
CyclicalTimeEncoder(period_map={'day_of_week': 7, 'hour': 24, 'month': 12})['amt', 'city_pop', 'distance_cardholder_merchant', 'age', 'card_prev_fraud_ratio']
MinMaxScaler()
SMOTE(random_state=42, sampling_strategy=0.1)
TabNetWrapper()
# Evaluate
y_pred_tabnet_smote01 = pipe.predict(X_test)
y_proba_tabnet_smote01 = pipe.predict_proba(X_test)[:, 1]
plot_model_performance(y_test, y_pred_tabnet_smote01, y_proba_tabnet_smote01, model_name="TabNet SMOTE - 0.1")
First model: Tabnet with SMOTE 0.2%
# Data Preparation (same as before)
X_train = df_train.drop(columns=["is_fraud"])
y_train = df_train["is_fraud"].astype(int)
X_test = df_test.drop(columns=["is_fraud"])
y_test = df_test["is_fraud"].astype(int)
# Defining TabNet parameters
tabnet = TabNetWrapper(
n_d=32,
n_a=32,
n_steps=5,
gamma=1.5,
lambda_sparse=1e-4,
optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=1e-3),
mask_type='entmax',
device_name='cuda' if torch.cuda.is_available() else 'cpu'
)
# Build full sklearn pipeline
smote = SMOTE(
sampling_strategy=0.2, # minority class will be 20% of majority
random_state=42,
k_neighbors=5
)
steps = [
("preprocess", DataFrameWrapper(preprocess)), # existing ColumnTransformer
("smote", smote),
("tabnet", tabnet)
]
pipe = Pipeline(steps)
# Train and evaluate
pipe.fit(X_train, y_train)
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:82: UserWarning: Device used : cuda
warnings.warn(f"Device used : {self.device}")
/usr/local/lib/python3.12/dist-packages/pytorch_tabnet/abstract_model.py:687: UserWarning: No early stopping will be performed, last training weights will be used.
warnings.warn(wrn_msg)
epoch 0 | loss: 0.18298 | 0:00:49s epoch 1 | loss: 0.0804 | 0:01:37s epoch 2 | loss: 0.05342 | 0:02:26s epoch 3 | loss: 0.04007 | 0:03:15s epoch 4 | loss: 0.03217 | 0:04:03s epoch 5 | loss: 0.02855 | 0:04:52s epoch 6 | loss: 0.02505 | 0:05:40s epoch 7 | loss: 0.02303 | 0:06:28s epoch 8 | loss: 0.0242 | 0:07:16s epoch 9 | loss: 0.02155 | 0:08:04s
Pipeline(steps=[('preprocess',
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_sampl...
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))),
('smote', SMOTE(random_state=42, sampling_strategy=0.2)),
('tabnet', TabNetWrapper())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess',
DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_sampl...
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))),
('smote', SMOTE(random_state=42, sampling_strategy=0.2)),
('tabnet', TabNetWrapper())])DataFrameWrapper(transformer=ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['state'...
('onehot_low',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['category',
'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour',
'day_of_week',
'month']),
('scaler',
MinMaxScaler(),
['amt',
'city_pop',
'distance_cardholder_merchant',
'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False))ColumnTransformer(transformers=[('merchant_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['merchant']),
('job_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['job']),
('city_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['city']),
('state_rate',
FraudRateEncoder(min_samples=100,
smoothing=100),
['state']),
('card_freq', CardFrequency...
('onehot_low',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False),
['category', 'gender']),
('cyclical_time',
CyclicalTimeEncoder(period_map={'day_of_week': 7,
'hour': 24,
'month': 12}),
['hour', 'day_of_week', 'month']),
('scaler', MinMaxScaler(),
['amt', 'city_pop',
'distance_cardholder_merchant', 'age',
'card_prev_fraud_ratio'])],
verbose_feature_names_out=False)['merchant']
FraudRateEncoder(min_samples=100, smoothing=100)
['job']
FraudRateEncoder(min_samples=100, smoothing=100)
['city']
FraudRateEncoder(min_samples=100, smoothing=100)
['state']
FraudRateEncoder(min_samples=100, smoothing=100)
['cc_num']
CardFrequencyEncoder()
['category', 'gender']
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
['hour', 'day_of_week', 'month']
CyclicalTimeEncoder(period_map={'day_of_week': 7, 'hour': 24, 'month': 12})['amt', 'city_pop', 'distance_cardholder_merchant', 'age', 'card_prev_fraud_ratio']
MinMaxScaler()
SMOTE(random_state=42, sampling_strategy=0.2)
TabNetWrapper()
# Evaluate
y_pred_tabnet_smote = pipe.predict(X_test)
y_proba_tabnet_smote = pipe.predict_proba(X_test)[:, 1]
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tabnet_smote))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tabnet_smote))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba_tabnet_smote))
Confusion Matrix:
[[540084 13490]
[ 619 1526]]
Classification Report:
precision recall f1-score support
0 1.00 0.98 0.99 553574
1 0.10 0.71 0.18 2145
accuracy 0.97 555719
macro avg 0.55 0.84 0.58 555719
weighted avg 1.00 0.97 0.98 555719
ROC-AUC Score: 0.9623903868991246
# Visualize Results
plot_model_performance(y_test, y_pred_tabnet_smote, y_proba_tabnet_smote, model_name="TabNet SMOTE - 0.2")
Results¶
y_pred_tabnet_smote02 = y_pred_tabnet_smote
y_proba_tabnet_smote02 = y_proba_tabnet_smote
compare_three_models(
y_true=y_test,
preds_1=y_pred_tabnet,
probs_1=y_proba_tabnet,
preds_2=y_pred_tabnet_smote01,
probs_2=y_proba_tabnet_smote01,
preds_3=y_pred_tabnet_smote02,
probs_3=y_proba_tabnet_smote02,
model_names=('TabNet', 'TabNet (SMOTE 0.1)', 'TabNet (SMOTE 0.2)')
)
Metric TabNet TabNet (SMOTE 0.1) TabNet (SMOTE 0.2) Precision 0.83 0.55 0.10 Recall 0.31 0.64 0.71 F1-score 0.46 0.59 0.18 ROC-AUC 0.95 0.98 0.96
Model Performance
Introducing two levels of SMOTE oversampling (0.1 and 0.2) reveals how the balance between fraud sensitivity and prediction accuracy shifts as synthetic samples increase.
The baseline TabNet remains highly precise (0.83) but conservative, identifying only 31% of frauds (recall = 0.31). This depicts a cautious classifier that avoids false positives but misses many frauds.
With SMOTE 0.1, the model becomes more balanced - recall increases substantially to 0.64, while precision decreases moderately to 0.55. The resulting F1-score of 0.59 marks the most effective compromise between precision and recall, supported by a near-perfect ROC-AUC of 0.98.
At SMOTE 0.2, the model becomes highly sensitive (recall peaks at 0.71) but precision collapses to 0.10, indicating many false alarms. The F1-score correspondingly drops to 0.18.
In summary:
If your goal is maximum precision and fewer false positives → choose TabNet (no SMOTE).
If you seek the best overall balance between detecting and correctly classifying fraud → choose TabNet (SMOTE 0.1).
If you prioritize catching as many frauds as possible, even at the cost of high false positives → choose TabNet (SMOTE 0.2).
Models Conclusion¶
Model Performance Summary
| Model | Precision | Recall | F1-score | ROC-AUC |
|---|---|---|---|---|
| Logistic Regression | 0.07 | 0.00 | 0.00 | 0.85 |
| Logistic Regression + SMOTE | 0.19 | 0.62 | 0.29 | 0.92 |
| Random Forest | 0.45 | 0.07 | 0.12 | 0.84 |
| Random Forest + SMOTE | 0.35 | 0.07 | 0.12 | 0.85 |
Random Forest (SMOTE + Tuned) | 0.01 | 0.09 | 0.02 | 0.68 | | XGBoost | 0.07 | 0.21 | 0.10 | 0.86 | | XGBoost + SMOTE | 0.07 | 0.27 | 0.11 | 0.87 | | XGBoost (SMOTE + Tuned) | 0.05 | 0.33 | 0.09 | 0.87 | | Neural Network | 0.28 | 0.64 | 0.39 | 0.96 | | TabNet | 0.83 | 0.31 | 0.46 | 0.95 | | TabNet + SMOTE-0.1 | 0.54 | 0.64 | 0.59 | 0.98 | | TabNet + SMOTE-0.2 | 0.1 | 0.71 | 0.18 | 0.96 |
The best result across a given colum is bolded, and the second best result is underlined.
Conclusion
Across all models evaluated, results highlight a clear evolution from simple linear approaches to advanced deep learning architectures, both in predictive strength and practical usability for fraud detection.
While Logistic Regression initially struggled with extreme class imbalance, applying SMOTE transformed it into a strong and reliable baseline, achieving a solid ROC-AUC of 0.92. This demonstrates that even straightforward models can be highly effective when supported by proper data balancing techniques.
Ensemble methods like Random Forest and XGBoost failed to improve the overall model performance, indicating difficulty in capturing the rare and complex fraud patterns within the data.
The Neural Network achieved a good balance, with strong recall (0.64) and an impressive ROC-AUC of 0.96, showing its ability to model non-linear relationships. However, it was the TabNet family of models that clearly stood out - demonstrating top-tier performance. Particularly, TabNet with SMOTE 0.1 delivered the best overall results, achieving the highest F1-score (0.59) and ROC-AUC (0.98), representing a near-optimal balance between detecting fraud and minimizing false alarms.
From a business standpoint, these findings suggest that advanced tabular deep learning models like TabNet can significantly enhance fraud detection pipelines. When paired with careful oversampling, they maximize the detection of fraudulent transactions without overwhelming analysts with false positives - leading to higher operational efficiency, improved risk management, and reduced financial loss.
Final Project Conclusion¶
This project set out to build a reliable fraud detection system using a range of machine learning and deep learning models - tackling one of the core aspects of data science: extreme class imbalance.
From the EDA phase, the dataset proved to be clean and informative, with no missing values, duplicates, or major outliers. The features appeared meaningful and potentially predictive, providing a solid foundation for modeling.
The unsupervised learning experiments (PCA, t-SNE, and K-Means) were an ambitious attempt to uncover hidden patterns, but they did not produce any actionable insights that could significantly improve model performance - an expected outcome given the highly imbalanced nature of the data.
In contrast, the feature engineering process delivered strong results. Careful experimentation - including creative additions such as the is_night feature - demonstrated clear performance gains across multiple models. Similarly, the preprocessing pipeline proved effective: techniques like fraud rate encoding for categorical features meaningfully improved discrimination between fraudulent and legitimate transactions.
Regarding modeling, several models performed well. Logistic Regression provided a surprisingly strong baseline once SMOTE was applied, showing that even simple models can yield competitive results with proper class balancing. The TabNet models emerged as the clear winners, achieving the best trade-off between precision and recall, especially when combined with SMOTE (0.1). Their strong generalization make them ideal candidates for real-world fraud detection applications.
Overall, this project demonstrates that success in fraud detection is not just about model complexity but about thoughtful preprocessing, feature design, and class balancing - with deep tabular architectures like TabNet showing exceptional promise for future deployment.
Future Work & Recommendations¶
Our exploration into credit card fraud detection underscores the potential of machine learning for tackling complex and highly imbalanced datasets. While our models achieved strong performance, particularly TabNet with SMOTE, the evolving nature of fraud demands continuous refinement.
Model Optimization and Exploration
Apply more advanced hyperparameter tuning (for instance, Bayesian Optimization) for deeper performance improvements.
Explore additional architectures like LightGBM, CatBoost, or even the relatively new T-JEPA to model transaction relationships.
Deep Learning for Temporal Patterns
Leverage RNNs or LSTMs to capture sequential dependencies in transaction data and detect behavioral shifts over time
Due to computational limits, this remains future work once greater GPU resources become available
Feature Engineering Enhancements
Create interaction features. For instance:
category x hour: Analyzes the activity of the different categories across the different hours of day. This could also add an additional layer of behavioral analysis.merchant x amount: The average cash transactions of different merchants. This might help to identify potential suspitious merchants.